airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.4k stars 3.97k forks source link

🐛 Mongo to BigQuery: Long (Int64) is serialized as float-like (scientific) number #9590

Closed joelluijmes closed 5 months ago

joelluijmes commented 2 years ago
## Environment - **Airbyte version**: 0.35.5-alpha - **OS Version / Instance**: Google Kubernetes - **Deployment**: Kubernetes - **Source Connector and version**: Mongo V2 0.1.11

Current Behavior

In our Mongo source, we have large numeric data (i.e., of type long). For instance the value: 3241201784 (> 2.1B). However, somewhere during the sync (not sure which component), the value is serialized as float-like type to 3.241201784E9. This is a problem, because the value is actually a numeric id. Thus I need the exact value, not a float-like type.

For the records I checked, the value as-is preserves the same precision (i.e., the same amount of decimal digits are present). So theoretically I can convert them back to long/int64. However I’m not confident that it is always this case and/or will happen when we roll over to >10B records.

Expected Behavior

Instead, I'd expected one of the following behaviors:

Steps to Reproduce

  1. Get a large number in mongo
  2. Sync it through Airbyte to BigQuery
  3. See results as scientific float like value

Are you willing to submit a PR?

Sure, but could use some guidance on how to debug the full flow to see which part the serialization goes wrong.

joelluijmes commented 2 years ago

Triage:

Airbyte did recognize the mongo schema as number (see screenshot). FYI, the field with large values is the _id (i.e., custom _id field, not using ObjectId here). Upon looking at the source, INT32, INT64, DOUBLE and DECIMAL are all treated as JSON number. So that seems fine.

image

Subsequently, I hypothese the issue might be the serialization to JSON (source), i.e., an issue with the library. Or upon deserialization in the destination connector it is assumed to be float-like because it's a large value.

harshithmullapudi commented 2 years ago

Hey does it makes sense to change it to string and handle it whatever format you need over Custom DBT? Does this sound good?

joelluijmes commented 2 years ago

If I can instruct Airbyte to to treat it as string, that be great! Is that possible? If you mean at Mongo, no that wont be viable as we have to change our applications then.

auyer commented 2 years ago

Hi ! Had the same issue while loading BIGSERIAL from postgres do S3 (parquet). No dbt transformations involved (or even supported fo that matter). image

image

joelluijmes commented 1 year ago

I noticed a PR was merged which should fix this issue (#14362). I noticed after my report two other issues (#12606 and #12057) were opened, and supposedly closed by the aforementioned PR.

I just tested this again, and unfortunately the data is still synced as floats instead of integer. I created a new connection to ensure it ran schema discovery again.

Can this issue be researched again?


Meanwhile we are running:

edgao commented 1 year ago

ah, that was my bad - I didn't notice the mongodb in the list of sources. Unfortunately mongo isn't a jdbc source (since it's not even an rdbms) so it wasn't solved by https://github.com/airbytehq/airbyte/pull/14362.

I did a little digging - updating this mapping is a good starting point (i.e. mapping INT32 and INT64 to JsonSchemaType.INTEGER instead of NUMBER). I probably can't get to this in a reasonable timeframe, but super happy to help workshop a pull request with anyone interested. (there's probably some additional work to update the tests/etc which I didn't find on my first passthrough)

under the hood, what's happening is that source-mongodb is discover-ing these fields as numbers, so the destination is creating a floating-point column. If we update source-mongodb to discover them as integers, then that'll probably be sufficient to make the destination write them correctly. (theoretically, we'd also want to update how source-mongodb read-s these values, but I think it's already handling integers correctly)

marcosmarxm commented 5 months ago

Normalization and custom dbt are deprecated features and will be removed soon from the project. For that reason I'm closing the issue as it won't be implemented anymore.