batch loading sometimes missing a records

arickbro commented 3 years ago

Hi

I setup 2 connector that sink data into bigquery for the same topic, which a producer produce several messages from a files.

this is the connector configuration using stream insert

connector.class=com.wepay.kafka.connect.bigquery.BigQuerySinkConnector autoUpdateSchemas=true sanitizeTopics=true autoCreateTables=true tasks.max=1 topics=sometopic schemaRegistryLocation=http://schema-registry:8081 topicsToTables=sometopic=sometopic_stream project=some_project maxWriteSize=10000 datasets=.*=somedataset keyfile=somekey.json name=sink-bigquery-stream schemaRetriever=com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever key.converter=org.apache.kafka.connect.storage.StringConverter tableWriteWait=1000 bufferSize=100000

this is the connector configuration using batch load connector.class=com.wepay.kafka.connect.bigquery.BigQuerySinkConnector gcsBucketName=some_bucket autoUpdateSchemas=true sanitizeTopics=true autoCreateTables=true tasks.max=4 topics=sometopic schemaRegistryLocation=http://schema-registry:8081 project=some_project maxWriteSize=10000 datasets=.*=somedataset enableBatchLoad=sometopic keyfile=somekey.json name=sink-bigquery schemaRetriever=com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever key.converter=org.apache.kafka.connect.storage.StringConverter tableWriteWait=1000 bufferSize=100000

I create another sink to bigquery for number of message I produce to kafka for each given filename as a result 6 out of 336 files have a missing records for batch load and no missing records for streaming insert. could someone give a hint how to debug this? or maybe something wrong with my connector configuration . I'm using this version of connector: kafka-connect-bigquery:1.6.6

C0urante commented 3 years ago

@arickbro are there any error or warning messages in the logs for the connector? Also, are you sure that the records are completely missing (i.e., irrecoverably skipped) as opposed to simply not written yet but due to be written at some point in the future?

arickbro commented 3 years ago

I dont see any error on the docker logs, see below 2021-01-28 21:10:04 UTC , there are 2 missing records for batch insert

I don' see any error in the log below log.txt

C0urante commented 3 years ago

How are you calculating the numbers for total streaming insert records and total batch load records?

arickbro commented 3 years ago

the data parsed from the files contain the filename. what I did is count(*) from table sometopic and sometopic_stream , than group it based on filename, and compare it with the log table

there are 3 tables on bigquery

log (contain filename and number of record for given filename)
sometopic_stream (contain parsed data from filename using bq stream insert from topic: sometopic_stream)
sometopic (contain parsed data from filename using bq batch load from topic: sometopic )
sometopic_stream & sometopic data coming from the same process

C0urante commented 3 years ago

When are you doing the count(*)? It's possible that there's some lag in when data gets batch loaded from GCS into BigQuery. To ensure this isn't the case, you might try publish a fixed amount of data upstream, then let both connectors run for a while, then check for data in BigQuery.

arickbro commented 3 years ago

I don't think it is possible, during my observation for couple of days the filename which has missing record never get updated.

C0urante commented 3 years ago

Hmm... it's difficult to say more about this without knowing more about your setup. I'm not too familiar with the batch loading logic but from what I can tell, anything that gets written to GCS should eventually get written to BigQuery, as long as the connector has permission to move it. Can you try to inspect the data in GCS just to be sure? Either data is never making its way into GCS, or it's getting stuck there somehow.

It'd also be helpful to know what version of the connector you're running, and how exactly the log table is populated.

jeonguihyeong commented 2 years ago

It has 3 things and it need to modify. it is parallel processing bugs. one it use same bucket and it didnot seperate name with tables and it didnot wait for bq loadong job and its temp gcs file name duplicated.

confluentinc / kafka-connect-bigquery

batch loading sometimes missing a records #76