confluentinc / kafka-connect-hdfs

Kafka Connect HDFS connector
Other
10 stars 397 forks source link

Over Hive `avro.schema.literal` max size due to so long Avro schema #331

Open daigorowhite opened 6 years ago

daigorowhite commented 6 years ago

Hi team, I have one issue with long Avro schema with kafka-connect-hdfs hive integration.

When I try to sink long schema table into HDFS with kafka-connect-hdfs.

Success to put data into HDFS , but I got this error when I throw query

0: jdbc:hive2://localhost:10000/default> select * from db.table limit 1 ;
Error: Error while compiling statement: FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered exception determining schema. Returning signal schema to indicate problem: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
 at [Source: java.io.StringReader@2c87f90b; line: 1, column: 6001]) (state=42000,code=40000)

I investigated root cause of this, and it is caused the Hive meta data param varchar size. https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L69 https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L95

mysql> describe TABLE_PARAMS;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| TBL_ID      | bigint(20)    | NO   | PRI | NULL    |       |
| PARAM_KEY   | varchar(256)  | NO   | PRI | NULL    |       |
| PARAM_VALUE | varchar(4000) | YES  |     | NULL    |       |
+-------------+---------------+------+-----+---------+-------+

Do you have any solution/idea for this in kafka-connect-hdfs?

rhauch commented 6 years ago

I've seen this workaround: log into Hive Metastore DB, then run alter table SERDE_PARAMS MODIFY PARAM_VALUE VARCHAR(40000);

Probably want to pick a realistic size, though.

daigorowhite commented 6 years ago

It is one way to avoid this problem, but I think it is depended Hive Setting side and could be shorten again by Hive upgrade. How about just use avro.schema.url to have un-limited size .avro ?

rhauch commented 6 years ago

@daigorowhite yes, using avro.schema.url is fine as long as it can be stored somewhere. HDFS is an obvious place, and that requires non-trivial code changes to handle all of the cases (e.g., the schema changes in a consumed topic and needs to be stored somewhere).

I only mentioned the workaround mentioned because it works today and requires no code changes. It may not be ideal, and it may not work for everyone.

OneCricketeer commented 6 years ago

See HIVE-12274.

You could manually apply the upgrade script for Mysql

daigorowhite commented 6 years ago

Thanks for sharing it! 👍

OneCricketeer commented 6 years ago

Duplicates #145

Vincent-Zeng commented 4 years ago

Hi, team. With hive.integration=true, how can kafka-connect-sink use avro.schema.url instead of avro.schema.literal. Or I need alter table manually in Hive?