Closed TennyZhuang closed 6 days ago
It seems like we should add a new capability MultiProcessingAccess
to guard the behavior.
It seems like we should add a new capability
MultiProcessingAccess
to guard the behavior.
SGTM, maybe Pickable
would be better?
Pickable
IMO all operators can be serialized as pickle.
In most Python libraries, all objects can be serialized as pickle, but it may not be identical when deserialized. The behavior is acceptable.
We need the capability only for tests. We can ignore backends which doesn't support the capability in the test.
I'm considering adding a shared
capability to indicate whether this storage service can be shared across different processes or even different instances.
Based on this definition:
memory
is not shared: it cannot be accessed by other processes.sled
is not shared: it has a unique lock that only one process can use.fs
is shared: different processes can access the same fs as long as they have it mounted.s3
is shared: obviously.I believe this capability is better than the following:
multi_processing_access
: too long and doesn't align with exstsing capability naming style.pickable
: python only and doesn't reflect the root cause.I don't like the idea that:
We need the capability only for tests
I view it as a design failure for OpenDAL to include something in the public API solely for testing purposes. shared
is somewhat useful for users who want to understand what will happen if they try to share the same storage services across processes or instances.
I'm considering adding a
shared
capability to indicate whether this storage service can be shared across different processes or even different instances.
SGTM
I don't like the idea that:
Please note that pickle
doesn’t mean shared
. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.
What we only can do is disable the tests for unique operator.
Please note that
pickle
doesn’t meanshared
. The user can pickle the operator, close the operator, pass the bytes to another process and unpickle the operator. So we should implement pickle for every operator, both shared and unique.
That's correct. As you mentioned, we just need to avoid testing it in a shared manner (which implies that we can access the same storage service after unpickling) if this storage service doesn't shared
.
I prefer to add the shared
capability, which I believe is a comprehensive solution. However, I'm also open to the python binding just testing against specified services like s3
, gcs
, and fs
if you feel that's sufficient.
In fact, there are two types of capabilities in all backends.
shared Access rights can be shared, all services backends implement this, while embedded engines don't.
send Access rights can be transferred Almost all backends implement this, except for memory backend such as dashmap.
Of course, we don't need to be so strict and complicated, I support temporarily adding only shared capability.
I support temporarily adding only shared capability.
Would you like to add this capability? I believe we can include them in the next 0.51 release.
I support temporarily adding only shared capability.
Would you like to add this capability? I believe we can include them in the next 0.51 release.
I can add it.
Please note that shared
only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.
However, I guess we will not support that in a while.
Please note that
shared
only indicates shared between processes but not shared between threads, it may be a little confusing if we will add backends which only support single-thread later.However, I guess we will not support that in a while.
Currently, we don't offer any services that aren't Send
because Access
requires services to be both Send
and Sync
. There are no plans to add such services, but we can revisit this decision if necessary.
Will test another PR use the CI flow, just ignore it.
Which issue does this PR close?
Partially finished #5316
Rationale for this change
Now an
Operator
can be serialized and deserialized via pickle, as a result, theOperator
class can be used in multiprocessing context.What changes are included in this PR?
In the PR, we store the arguments for
Operator
constructor, so that we can recover theOperator
from pickle deserialization.Are there any user-facing changes?
Operator
in python bindingOperator
to be used in multiprocessing context@Zheaoli I'm not very familiar with Python binding, PTAL for the PR, thanks!