airbytehq / PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.
https://docs.airbyte.com/pyairbyte
Other
202 stars 30 forks source link

💡 Feature request: Postgres source #82

Open WangCHEN9 opened 6 months ago

WangCHEN9 commented 6 months ago

Will be really nice if We can support more database source connectors :)

aaronsteers commented 6 months ago

@WangCHEN9 - Thanks for logging this. We're interested in learning more about your use case. Specifically:

  1. Do you want to replicate data from Postgres to another cache/destination, like Snowflake or a different Postgres DB? Or do you just want to get that data locally so it is available to your python code, in pandas/AI/etc.?
  2. For your use case, do you want to take advantage of built-in Potgres-native CDC features, such as auto-detecting new records with the WAL log (described here)? The alternative would be column-based incremental sync, for instance using an updated_at column or similar to detect new records.
WangCHEN9 commented 6 months ago

Hi @aaronsteers ,

I have 2 main use cases in mind:

For your questions :

  1. Yes, I am interested in replicate data from Postgres to S3 (with the help of DuckDB COPY function)
  2. I will prefer to use updated_at column for incremental loading new records. (it is easier for ingestion later on when you want load it as file)

Thanks, Wang

aaronsteers commented 6 months ago

@WangCHEN9 - Thanks very much for this explanation.

I've logged a couple different paths forwards. None of these approaches are trivial, unfortunately...

The most direct/obvious solution would be #87, but there are some technical barriers to us implementing this. There's another path forward in #85, which might be a smoother path for your use case. This 'cache-to-cache' implementation also has its own challenges, but those are more on us designing a good developer experience, less so on actual technical hurtles.

I noted in #87 a workaround which would be to pre-install the Java connector. Would love your thoughts and upvotes on any of those approaches. Thanks! 🙏

WangCHEN9 commented 6 months ago

Hi @aaronsteers ,

I will definitive upvote #85. Because it will able to unlock more much usecases, especially with the power of DuckDB.

For #87, Personally I don't like it. Asking user install java or docker is too much work. we kind of lost the advantage of PyAirbyte.

Wang

aaronsteers commented 6 months ago

@WangCHEN9 - This feedback is very helpful. Thank you!

Will keep you posted.