Open geonyeongkim opened 1 year ago
The input element of the SortOperator
should be a RowData
, because the serializer is hard coded into BinaryRowDataSerializer
.
Hello. I looked at HoodieFlinkStreamer in github and used JsonRowDataDeserializationSchema to troubleshoot SortOperator.
I have a few questions about it.
If the operation is set to BULK_INSERT, there will be no error.
However, it only consumes kafka messages and does not actually create parquet files in hdfs.
My code simply writes kafka messages to the hudi table on the hdfs.
@JvmStatic
fun main(args: Array<String>) {
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(5000)
val props = Configuration()
props.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, "avro schema")
val rowType = AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(props)).logicalType as RowType
val kafkaSource = KafkaSource.builder<RowData>()
.setBootstrapServers(bootstrapServers)
.setTopics(topic)
.setGroupId(SampleHudiApp::class.java.name)
.setClientIdPrefix(UUID.randomUUID().toString())
.setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))
.setDeserializer(CustomJsonRowDataDeserializationSchema(
rowType,
InternalTypeInfo.of(rowType),
false,
true,
TimestampFormat.ISO_8601
))
.build()
HoodiePipeline.builder("hudi_test_table")
.column("id BIGINT")
.column("name STRING")
.column("`partition_path` STRING")
.column("ts BIGINT")
.column("dc STRING")
.column("op STRING")
.pk("id")
.partition("partition_path")
.options(mapOf(
FlinkOptions.PATH.key() to "hdfs:///user/geonyeong.kim/hudi_flink_test",
FlinkOptions.TABLE_TYPE.key() to HoodieTableType.COPY_ON_WRITE.name,
FlinkOptions.INDEX_GLOBAL_ENABLED.key() to "false"
))
.sink(env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "hudi_source"), true)
env.execute("HUDI STREAM SINK")
}
I viewed org.apache.hudi.sink.utils.Pipelines class. And i confirmed that BulkInsertWriteFunction class is used for bulk_insert mode and AppendWriteFunction class is used for append mode.
However, if the index type is not Bucket in BulkInsertWriterFunction, BulkInsertWriterHelper is used.
AppendWriteFunction also uses BulkInsertWriterHelper. Then, if the index type is a FLINK_STATE, will the behavior of the two be the same?
I want to use Flink to apply Compress to the COW table when Parquet Write.
In Flink, Hudi Write created HoodieFlinkWriteClient in FlinkWriteClients based on the FlinkOptions value and confirmed that each WriteFunction uses it.
So I overrided the FlinkWriteClients class and added the parquetCompressionCodec("gzip") setting.
However, compress was not applied. Is this not applicable in Flink?
I want to actively introduce Hudi. I'd appreciate your help.
@danny0405 Hello. Could you answer the question above?
BulkInsertWriterHelper
to write the parquet files direcly, there is no UPSERTs, if FLINK_STATE
is used, things are very diffrent, the StreamWriteFunction
would kick in;HoodiePipeline#options
you have used:e.g.
create table xxx(
) with (
'connector' = 'hudi',
'hoodie.parquet.compression.codec' = 'gzip'
);
HoodiePipeline.builder("xxx")
.option("hoodie.parquet.compression.codec", "gzip")
The default codec is already gzip, probably that is the reason you do not perceive any difference
@danny0405 Thank you for your reply.
Ckp means checkpoint, right?
As shown in the attached picture, checkpoint is performed normally.
But still no file in hdfs while consuming kafka message.
Moreover, the problem is that we are committing to the kafka broker.
checkpoint
hdfs directory
Then, in case of FLINK_STATE, can you tell me the difference between bulk_insert and append in detail?
I tried to restart by adding the settings below as a guide.
HoodiePipeline.builder("xxx")
.option("hoodie.parquet.compression.codec", "gzip")
However, gzip compression still does not apply.
I know that compression is difficult to apply in the stream associated with Hadoop.
But it's very strange that bulk_insert doesn't work.
Then, in case of FLINK_STATE, can you tell me the difference between bulk_insert and append in detail?
Flink state index only works for UPSERT operation, not BULK_INSERT.
But it's very strange that bulk_insert doesn't work.
Bulk insert only works in batch execution mode.
Describe the problem you faced
Hello. I'm going to get the log data in json format from kafka and create an app that loads it into the hudi table using the hudi stream api.
Operation has been set to BULK_INSERT to load log data.
However, if you set it to BULK_INSERT, the casting problem will occur as follows.
KryoSerializer cannot be cast to class org.apache.flink.table.runtime.typeutils.AbstractRowDataSerializer
This occurs during the opening of the Sort Operator class.
Flink uses Kryo as the default Serializer.
How can I use Sort Operator to perform BULK_INSERT?
Below is my code.
Environment Description
Hudi version : 0.12.2
Flink version : 1.15.1