confluentinc / kafka-connect-hdfs

Kafka Connect HDFS connector
Other
2 stars 0 forks source link

Set GZIP compression for Parquet FIle #487

Open zizake opened 4 years ago

zizake commented 4 years ago

Hello,

I have the following configuration for sink connector. Is there any possibility to set a custom compression for Parquet files? By default is Snappy, i would like to change it to GZIP due to the better ratio of compression.

In hive the equivalent would command would be: SET parquet.compression=GZIP;

_connector.class=io.confluent.connect.hdfs.HdfsSinkConnector hadoop.conf.dir=/etc/hadoop/conf flush.size=10000 schema.compatibility=BACKWARD tasks.max=1 topics=kafkaplayground timezone=UTC hdfs.url=hdfs://XXXXXXXXXXXXXx:8020 hive.metastore.uris=thrift://XXXXXXXXXXX:9083 locale=en-us key.converter.schemas.enable=false value.converter.schema.registry.url=http://XXXXXXXXXXXXXX:8081 hive.integration=true format.class=io.confluent.connect.hdfs.parquet.ParquetFormat partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner value.converter=io.confluent.connect.avro.AvroConverter

Thanks!

levzem commented 4 years ago

@zizake unfortunately it looks like we don't support changing the compression yet, but that could be a good contribution if you are interested in opening a PR

moeinxyz commented 1 year ago

@levzem I see the following claim in the documentation. If that's so, does that mean the documentation is not accurate?

parquet.codec The Parquet compression codec to be used for output files. Type: string Default: snappy Valid Values: [none, snappy, gzip, brotli, lz4, lzo, zstd] Importance: low