airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.59k stars 4.01k forks source link

Port MongoDB Source to Java #3428

Closed sherifnada closed 3 years ago

sherifnada commented 3 years ago

Tell us about the new integration you’d like to have

MongoDB is a critical source to support. Our current connector was contributed by a user. However, while the implementation is generally high quality, it is written in Ruby, and the Airbyte Core team's proficiencies are Java & Python. This means that we are much slower to implement features & bugfixes due to a lack of proficiency in Ruby. So we'd like to port the connector over to one of our core languages in order to offer better SLA & support.

Describe the alternative you are considering or using

Continue to use current Ruby-based connector

Implementation:

test container to use: https://www.testcontainers.org/modules/databases/mongodb/

Todo:

  1. Investigate possibility of using JDBC driver for mongo db (https://www.unityjdbc.com/mongojdbc/mongo_jdbc.php), but this seems to have the only paid jdbc driver which is not an option for us. Another option to check https://docs.mongodb.com/datalake/tutorial/jdbc-driver/ and https://search.maven.org/search?q=a:mongodb-jdbc, but not sure how supportable it this driver for all DBs, this seems to be a kinda specific for Atlas Data Lake In case of possible use of jdbc
  2. Generate new connector and implement connections.
  3. Create unit test
  4. Integration test
  5. Comprehensive:

use existing mongo source

┆Issue is synchronized with this Asana task by Unito

Notes It seems like the JDBC driver provided by unityjdbc is paid. So we have the same case here as it was for BigQuery. @DoNotPanicUA is currently working on db sources refactoring and implementation to make core better for such cases. So there is no value to start working on this ticket until the #4024 and #1876 are not completed. Then we would also need to support non-jdbc tests basics. Aa this is non JDBC and even non SQL DB additional work in core part would be also required

sherifnada commented 3 years ago

Mongo is schemaless which will be a tricky situation to support. The most difficult problem we need to work out for the MVP release is how to support incremental sync. To support incremental sync, we need to discover the schema in each collection (table). I propose that we sample 10,000 records (or some other number) in each collection to discover the schema.

We should transform the schema as follows:

  1. If a column is a simple property (not object or array) the schema should preserve its type.
  2. If a column is an object or an array, the schema should say that it's an object or array, but make no further attempt to understand its schema.

This connector must support full refresh and incremental sync.

nathan5280 commented 3 years ago

Past MVP it seems like we could do some additional Normalization work. 1) If object, scan some number of them to discover their schema and break them out into a child table.
2) If list, scan some number of them to discover their schema and break them out into a child table. Preserve the index as a new column.

If we could do that recursively it would be awesome. This would take the schemaless mongo and get it into a reasonable normalized schema in a relational model. For reasonably well formed mongo documents this would save a ton of custom DBT transforms.

tuliren commented 3 years ago

Many Mongo users use Mongoose to define the schema and communicate with Mongo. It would be cool if users can just drop in the Mongoose schema for each collection, and the source connector can convert them to Json Schema.