confluentinc / kafka-connect-bigquery

A Kafka Connect BigQuery sink connector
Apache License 2.0
1 stars 1 forks source link

batch loading writes very few records per file (usually 1) #130

Open ideasculptor opened 3 years ago

ideasculptor commented 3 years ago

When I enable batch loading and send 500 messages to the topic, I get a directory in the storage bucket that contains 469 separate files even though all 500 messages are received within the 2 minute batch load window. If I have a topic that has a high volume of messages, I definitely want to batch load so that I don't run into the bigquery quotas, but writing one record per file in GCS is just going to result in my running into cloud storage quotas instead. I do get a few files that have as many as 10 records in them, but the vast majority have only a single record. In my simple test case, I have a producer that just spits out 500 messages as fast as it can. The vast majority are surely landing within the same second given that the messages are very small.

Also, I was a little surprised to see that the batch data is written in a schemaless format. I was expecting to see avro data files or similar. How will schema evolution be correctly handled when batch loading data if the batch data has no schema information?

ideasculptor commented 3 years ago

Once I got the permissions fixed up so that the connector could insert the data into bigquery (it would be VERY good if the documentation listed the set of roles that are needed), it started generating this error:

connect            | [2021-09-03 00:58:39,215] ERROR Found blob mydomain_events/events/MyDomainEvent/dt=2021-09-02/hr=18/MyDomainEvent+0+0000001000.avro with no metadata. (com.wepay.kafka.connect.bigquery.GCSToBQLoadRunnable)

But that directory is NOT the directory it was writing the batch data to. The batch data gets written to mydomain_events/batch

I have a GCS sink connector writing to the same bucket, at mydomain_events/events where the topic name is MyDomainEvent and the data is partitioned. That data is correctly written as avro binary data. Why is the bigquery sink connector reading from directories that it is not configured to write batch data to?