airbytehq / PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.
https://docs.airbyte.com/pyairbyte
Other
224 stars 36 forks source link

Add multi-source support for caches #3

Open aaronsteers opened 8 months ago

aaronsteers commented 8 months ago

We have logged this issue to add support for data from multiple sources to be saved within the same cache.

Our implementation might already support this, since our internal caches and streams tables are (in theory) able to support data from multiple source names.

Before investing in dev side, we should probably try to prioritize some tests to confirm whether this is working or not. As things stand, this is relatively low priority.

aaronsteers commented 6 months ago

@bindipankhudi - Here is the example notebook I was referring to earlier.

https://colab.research.google.com/drive/1YC_vCfrEwO7SzZFCN1X2PwevMLeGYDeC#scrollTo=Y-0YC-Qhl80W

Specifically, this part:

image

While I didn't explicitly declare or assign a cache, I believe these would all default to the equivalent get_default_cach().

Also, I'm not sure what would happen if these had streams sharing the same name.

bindipankhudi commented 6 months ago

When the same stream name exists in multiple source, things don't work. For instance, in this notepad: https://colab.research.google.com/drive/197-utzu1I0iMd5Gua0tyFUL2Gu_LFws1?usp=shari we are using source-faker and source-github both of which have "users" schema. We load from github first and then loading from faker fails because it expects the schema columns from Github.

bindipankhudi commented 6 months ago

Let's see if we can fail with an accurate message.

bindipankhudi commented 5 months ago

De-prioritizing an removing iteration label for now. We will prioritize this if we hear related requests from customers.