apache / opendal

Apache OpenDAL: One Layer, All Storage.
https://opendal.apache.org
Apache License 2.0
3.46k stars 486 forks source link

feat(bindings/python): support pickle [de]serialization for Operator #5324

Closed TennyZhuang closed 6 days ago

TennyZhuang commented 1 week ago

Which issue does this PR close?

Partially finished #5316

Rationale for this change

Now an Operator can be serialized and deserialized via pickle, as a result, the Operator class can be used in multiprocessing context.

What changes are included in this PR?

In the PR, we store the arguments for Operator constructor, so that we can recover the Operator from pickle deserialization.

Are there any user-facing changes?

@Zheaoli I'm not very familiar with Python binding, PTAL for the PR, thanks!

TennyZhuang commented 1 week ago

It seems like we should add a new capability MultiProcessingAccess to guard the behavior.

Zheaoli commented 6 days ago

It seems like we should add a new capability MultiProcessingAccess to guard the behavior.

SGTM, maybe Pickable would be better?

TennyZhuang commented 6 days ago

Pickable

IMO all operators can be serialized as pickle.

In most Python libraries, all objects can be serialized as pickle, but it may not be identical when deserialized. The behavior is acceptable.

We need the capability only for tests. We can ignore backends which doesn't support the capability in the test.

Xuanwo commented 6 days ago

I'm considering adding a shared capability to indicate whether this storage service can be shared across different processes or even different instances.

Based on this definition:

I believe this capability is better than the following:


I don't like the idea that:

We need the capability only for tests

I view it as a design failure for OpenDAL to include something in the public API solely for testing purposes. shared is somewhat useful for users who want to understand what will happen if they try to share the same storage services across processes or instances.

Zheaoli commented 6 days ago

I'm considering adding a shared capability to indicate whether this storage service can be shared across different processes or even different instances.

SGTM

TennyZhuang commented 6 days ago

I don't like the idea that:

Please note that pickle doesn’t mean shared. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.

What we only can do is disable the tests for unique operator.

Xuanwo commented 6 days ago

Please note that pickle doesn’t mean shared. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.

That's correct. As you mentioned, we just need to avoid testing it in a shared manner (which implies that we can access the same storage service after unpickling) if this storage service doesn't shared.

I prefer to add the shared capability, which I believe is a comprehensive solution. However, I'm also open to the python binding just testing against specified services like s3, gcs, and fs if you feel that's sufficient.

TennyZhuang commented 6 days ago

In fact, there are two types of capabilities in all backends.

  1. shared Access rights can be shared, all services backends implement this, while embedded engines don't.

  2. send Access rights can be transferred Almost all backends implement this, except for memory backend such as dashmap.

Of course, we don't need to be so strict and complicated, I support temporarily adding only shared capability.

Xuanwo commented 6 days ago

I support temporarily adding only shared capability.

Would you like to add this capability? I believe we can include them in the next 0.51 release.

TennyZhuang commented 6 days ago

I support temporarily adding only shared capability.

Would you like to add this capability? I believe we can include them in the next 0.51 release.

I can add it.

Please note that shared only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.

However, I guess we will not support that in a while.

Xuanwo commented 6 days ago

Please note that shared only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.

However, I guess we will not support that in a while.

Currently, we don't offer any services that aren't Send because Access requires services to be both Send and Sync. There are no plans to add such services, but we can revisit this decision if necessary.

TennyZhuang commented 6 days ago

Will test another PR use the CI flow, just ignore it.