Closed sherifnada closed 3 years ago
Mongo is schemaless which will be a tricky situation to support. The most difficult problem we need to work out for the MVP release is how to support incremental sync. To support incremental sync, we need to discover the schema in each collection (table). I propose that we sample 10,000 records (or some other number) in each collection to discover the schema.
We should transform the schema as follows:
This connector must support full refresh and incremental sync.
Past MVP it seems like we could do some additional Normalization work.
1) If object, scan some number of them to discover their schema and break them out into a child table.
2) If list, scan some number of them to discover their schema and break them out into a child table. Preserve the index as a new column.
If we could do that recursively it would be awesome. This would take the schemaless mongo and get it into a reasonable normalized schema in a relational model. For reasonably well formed mongo documents this would save a ton of custom DBT transforms.
Tell us about the new integration you’d like to have
MongoDB is a critical source to support. Our current connector was contributed by a user. However, while the implementation is generally high quality, it is written in Ruby, and the Airbyte Core team's proficiencies are Java & Python. This means that we are much slower to implement features & bugfixes due to a lack of proficiency in Ruby. So we'd like to port the connector over to one of our core languages in order to offer better SLA & support.
Describe the alternative you are considering or using
Continue to use current Ruby-based connector
Implementation:
test container to use: https://www.testcontainers.org/modules/databases/mongodb/
Todo:
use existing mongo source
┆Issue is synchronized with this Asana task by Unito
Notes It seems like the JDBC driver provided by unityjdbc is paid. So we have the same case here as it was for BigQuery. @DoNotPanicUA is currently working on db sources refactoring and implementation to make core better for such cases. So there is no value to start working on this ticket until the #4024 and #1876 are not completed. Then we would also need to support non-jdbc tests basics. Aa this is non JDBC and even non SQL DB additional work in core part would be also required