Open arickbro opened 3 years ago
@arickbro are there any error or warning messages in the logs for the connector? Also, are you sure that the records are completely missing (i.e., irrecoverably skipped) as opposed to simply not written yet but due to be written at some point in the future?
I dont see any error on the docker logs, see below 2021-01-28 21:10:04 UTC , there are 2 missing records for batch insert
I don' see any error in the log below log.txt
How are you calculating the numbers for total streaming insert records and total batch load records?
the data parsed from the files contain the filename. what I did is count(*) from table sometopic and sometopic_stream , than group it based on filename, and compare it with the log table
there are 3 tables on bigquery
sometopic_stream & sometopic data coming from the same process
When are you doing the count(*)
? It's possible that there's some lag in when data gets batch loaded from GCS into BigQuery. To ensure this isn't the case, you might try publish a fixed amount of data upstream, then let both connectors run for a while, then check for data in BigQuery.
I don't think it is possible, during my observation for couple of days the filename which has missing record never get updated.
Hmm... it's difficult to say more about this without knowing more about your setup. I'm not too familiar with the batch loading logic but from what I can tell, anything that gets written to GCS should eventually get written to BigQuery, as long as the connector has permission to move it. Can you try to inspect the data in GCS just to be sure? Either data is never making its way into GCS, or it's getting stuck there somehow.
It'd also be helpful to know what version of the connector you're running, and how exactly the log
table is populated.
It has 3 things and it need to modify. it is parallel processing bugs. one it use same bucket and it didnot seperate name with tables and it didnot wait for bq loadong job and its temp gcs file name duplicated.
Hi
I setup 2 connector that sink data into bigquery for the same topic, which a producer produce several messages from a files.
this is the connector configuration using stream insert
connector.class=com.wepay.kafka.connect.bigquery.BigQuerySinkConnector autoUpdateSchemas=true sanitizeTopics=true autoCreateTables=true tasks.max=1 topics=sometopic schemaRegistryLocation=http://schema-registry:8081 topicsToTables=sometopic=sometopic_stream project=some_project maxWriteSize=10000 datasets=.*=somedataset keyfile=somekey.json name=sink-bigquery-stream schemaRetriever=com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever key.converter=org.apache.kafka.connect.storage.StringConverter tableWriteWait=1000 bufferSize=100000
this is the connector configuration using batch load connector.class=com.wepay.kafka.connect.bigquery.BigQuerySinkConnector gcsBucketName=some_bucket autoUpdateSchemas=true sanitizeTopics=true autoCreateTables=true tasks.max=4 topics=sometopic schemaRegistryLocation=http://schema-registry:8081 project=some_project maxWriteSize=10000 datasets=.*=somedataset enableBatchLoad=sometopic keyfile=somekey.json name=sink-bigquery schemaRetriever=com.wepay.kafka.connect.bigquery.schemaregistry.schemaretriever.SchemaRegistrySchemaRetriever key.converter=org.apache.kafka.connect.storage.StringConverter tableWriteWait=1000 bufferSize=100000
I create another sink to bigquery for number of message I produce to kafka for each given filename as a result 6 out of 336 files have a missing records for batch load and no missing records for streaming insert. could someone give a hint how to debug this? or maybe something wrong with my connector configuration . I'm using this version of connector: kafka-connect-bigquery:1.6.6