airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.09k stars 4.12k forks source link

Source Mongodb: using Amazon DocumentDB is replacing all fields called "identifier" with empty strings #17397

Closed krosenst closed 9 months ago

krosenst commented 2 years ago
## Environment - **Airbyte version**: 0.39.41-alpha - **OS Version / Instance**: macOS - **Deployment**: Docker - **Source Connector and version**: dev version of MongoDB (explained below) - **Destination Connector and version**: Snowflake 0.4.34 - **Step where error happened**: Sync job ## Current Behavior Note: the reason I was using a dev version of MongoDB is because AWS requires the use of its own Certificate Authorities. A coworker of mine forked the Airbyte repo and included this implementation to see if we could automate exports from one our DocumentDB instances to Snowflake. The bug came after this fix was implemented in our fork. The documents within some of our DocumentDB collections contain fields called "identifier". Some of these fields are in the form of arrays and others are in the form of objects. No matter what data type it is, if the field is called "identifier", it's value is being replaced with an empty string. Example: A document I see in DocumentDB: ``` { "_id": "", "active": true, "identifier": [{"system": "","value": ""}], "managingOrganization": { "identifier": {"system": "","value": ""} }, "meta": { "lastUpdated": "", "source": "" } } ``` The same document in Snowflake: ``` { "_id": "", "active": true, "identifier": "", "managingOrganization": { "identifier": "" }, "meta": { "lastUpdated": "", "source": "" } } ``` What I've tried so far: 1.) I first set up my own standalone MongoDB instance (once using version 6.0 and again using version 4.2 since that's what version of Mongo that our DocumentDB uses) and added documents that follow the schema above. I was unable to replicate this issue this way. 2.) I then logged into our DocumentDB instance and added dummy data very similar to above. Some documents I kept as-is and others I changed the "identifier" field names to something else (such as "org_identifier"). When I ran a sync after this, the "identifier" fields were being replaced with empty strings as expected but the ones that I changed to "org_identifier" and other names were actually being retained. Judging by above, there's a chance that this bug comes from inconsistencies in DocumentDB's emulation of MongoDB's API so that's worth noting. A support ticket has been submitted to Amazon. ## Expected Behavior No data should be lost in the documents/JSONs ## Logs [https://drive.google.com/file/d/1vlXl_hAOoHOshq4NE0PTC8Spx6s9OeBY/view?usp=sharing](url) Warning says "Schema validation errors found for stream bonfireresources. Error messages: [$.subject.identifier is of an incorrect type. Expected it to be object, $.informationSource.identifier is of an incorrect type. Expected it to be object]" ## Steps to Reproduce 1. Not sure how easily reproducible this is without doing all of the work to allow for importing custom authorities for AWS 2. 3. ## Are you willing to submit a PR? Possibly
marcosmarxm commented 2 years ago

@krosenst do you mind uploading the log file here? One possible cause is that the MongoDB connector is identifying those fields as strings and not converting them because they're arrays/nested-objects. Can you confirm your custom connector is even with latest version of MongoDB connector as you said you are'nt able to reproduce the issue using native connector correct?

krosenst commented 2 years ago

@marcosmarxm I considered this too but considering there are other nested objects/arrays all throughout these documents and those aren't losing their values (whereas just "identifier" fields are). And yes, I can confirm the customer connector is even with the latest version of MongoDB connector and yes, I couldn't replicate the issue using a local standalone mongoDB instance. The reason I say it would be difficult to replicate it because it would likely require standing up a DocumentDB instance rather than a MongoDB instance and if you try that you will likely face the same Certificate Authority issues we faced. logs-14.txt

marcosmarxm commented 2 years ago

@krosenst unfortunately using the MongoDB for DocumentDB and the team can't ensure it will work as expected. Team suggestion is to you read DocumentDB and debug the connector to try to solve the issue. This issue won't be prioritize for current roadmap.