apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
473 stars 175 forks source link

Ability to pickle the `Catalog` #514

Open Fokko opened 8 months ago

Fokko commented 8 months ago

Feature Request / Improvement

This allows distribution of the Catalog object within Ray.

kevinjqliu commented 8 months ago

Copied from the other thread: https://github.com/apache/iceberg-python/issues/513#issuecomment-2009875953

I was looking at Pydantic docs and it looks like there's a helpful function which supports pickling and unpickling https://docs.pydantic.dev/latest/concepts/serialization/#pickledumpsmodel

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

kevinjqliu commented 2 months ago

@dev-goyal I see that you've merged ray-project/ray#46889 curious if you think pickling the Catalog or Table is still valuable to the Ray project

kevinjqliu commented 2 months ago

More context: https://github.com/ray-project/ray/pull/42235#discussion_r1520929199

dev-goyal commented 1 month ago

@dev-goyal I see that you've merged ray-project/ray#46889 curious if you think pickling the Catalog or Table is still valuable to the Ray project

I think it could help, but for me - I did not find recreating the catalog each time to add much overhead. So, it will likely just make the code cleaner is all

kevinjqliu commented 1 month ago

So, it will likely just make the code cleaner is all

Makes sense, the difference is directly pickling the catalog object so that it can be distributed to ray workers versus the current implementation which pass the catalog_kwargs to reconstruct the catalog at the worker node