airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.27k stars 4.15k forks source link

Destination Bigquery: fail normalization when dataset_id start with number #6732

Closed marcosmarxm closed 2 years ago

marcosmarxm commented 3 years ago
## Enviroment - **Airbyte version**: 0.30.3-alpha - **OS Version / Instance**: macOS - **Deployment**: Docker - **Source Connector and version**: Hubspot - **Destination Connector and version**: Bigquery 0.4.0 - **Severity**: Very Low / Low / Medium / **High** / Critical - **Step where error happened**: Sync Job ## Current Behavior One SQL generated file example, the dataset is wrong where `_111testing_raw` should be only `111testing_raw`. ``` select id, name, type, enabled, updatedAt, insertedAt, personaTagIds, contactListIds, _airbyte_emitted_at, _airbyte_workflows_hashid from __dbt__CTE__workflows_ab3 -- workflows from `dataline-integration-testing`._111testing_raw._airbyte_raw_workflows ); ``` ## Expected Behavior *Tell us what should happen.* ## Logs [logs-3-0.txt](https://github.com/airbytehq/airbyte/files/7280851/logs-3-0.txt) ## Steps to Reproduce 1. 2. 3. ## Are you willing to submit a PR?

Remove this with your answer.

sherifnada commented 2 years ago

as part of acceptance criteria we should add a test case for this as well

ChristopheDuong commented 2 years ago

This exception comes from the destination-bigquery:

2021-10-04 19:20:23 INFO () DefaultAirbyteStreamFactory(lambda$create$0):53 - 2021-10-04 19:20:23 INFO i.a.i.d.b.BigQueryUtils(createTable):92 - {} - Table: GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=111testing_raw, tableId=_airbyte_tmp_qny_campaigns}} created successfully
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - Exception in thread "main" com.google.cloud.bigquery.BigQueryException: 404 Not Found
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - POST https://www.googleapis.com/upload/bigquery/v2/projects/dataline-integration-testing/jobs?uploadType=resumable
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -   "error": {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "code": 404,
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "message": "Not found: Dataset dataline-integration-testing:111testing_raw",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "errors": [
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -       {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "message": "Not found: Dataset dataline-integration-testing:111testing_raw",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "domain": "global",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "reason": "notFound"
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -       }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     ],
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "status": "NOT_FOUND"
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -   }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - 
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.open(HttpBigQueryRpc.java:655)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.TableDataWriteChannel$2.call(TableDataWriteChannel.java:87)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.TableDataWriteChannel$2.call(TableDataWriteChannel.java:82)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.TableDataWriteChannel.open(TableDataWriteChannel.java:81)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.TableDataWriteChannel.<init>(TableDataWriteChannel.java:41)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.BigQueryImpl.writer(BigQueryImpl.java:1388)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at io.airbyte.integrations.destination.bigquery.BigQueryDestination.getConsumer(BigQueryDestination.java:214)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at io.airbyte.integrations.base.IntegrationRunner.run(IntegrationRunner.java:127)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at io.airbyte.integrations.destination.bigquery.BigQueryDestination.main(BigQueryDestination.java:337)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - Caused by: com.google.api.client.http.HttpResponseException: 404 Not Found
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - POST https://www.googleapis.com/upload/bigquery/v2/projects/dataline-integration-testing/jobs?uploadType=resumable
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -   "error": {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "code": 404,
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "message": "Not found: Dataset dataline-integration-testing:111testing_raw",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "errors": [
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -       {
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "message": "Not found: Dataset dataline-integration-testing:111testing_raw",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "domain": "global",
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -         "reason": "notFound"
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -       }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     ],
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     "status": "NOT_FOUND"
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -   }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - }
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 - 
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1116)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.open(HttpBigQueryRpc.java:652)
2021-10-04 19:20:24 ERROR () LineGobbler(voidCall):65 -     ... 11 more
2021-10-04 19:21:47 ERROR () DefaultReplicationWorker(run):128 - Sync worker failed.

Do we know why this is being thrown? and what do/did we do about it?

The official docs from Google BigQuery also states:

Unquoted identifiers must begin with a letter or an underscore character. Subsequent characters can be letters, numbers, or underscores
These are invalid identifiers:
5Customers

https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical

That's why it was implemented to insert an extra underscore in normalization in these cases and the final tables would be created (or queried) in dataline-integration-testing._111testing_raw

Of course, this would assume that the destination-bigquery would also follow the same logic and output the raw table there in the first place too...

Here it seems to apply the logic to table names but maybe not to datasets?

https://github.com/airbytehq/airbyte/blob/2ddf0bcf6515686a638a62d62f91cc438a2b4e2f/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQuerySQLNameTransformer.java#L15