IntelPython / DPPY-Spec

Draft specifications of DPPY
4 stars 1 forks source link

__partitioned__ protocol for partitioned and distributed data containers #3

Open fschlimb opened 3 years ago

fschlimb commented 3 years ago

The current state of the specification can be found here: https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md

fschlimb commented 2 years ago

Don't we lose an original positioning of the partitions in the grid in that case? That might be useful for consumers if they are more concerned about the positioning rather than mapping locations to list of partitions.

I agree this must be possible and I don't think we lose the information. We provide start and shape for each partition which implicitly provides the position in the grid. The question is what we think is the more common case. My experience shows that getting partitions per rank/location is the common case.

As for uniform objects for futures and locations, consumers may and may not want to check types of these futures and locations and call respective APIs on them. Consumers may call future.result() (in the case of Dask env) and as well as may call ray.get(future) (in the case of Ray env). Something similar relates to locations. However, if we will require consumers throw an exception in case they do not support information by the protocol (explicit checks), that would be ok.

Agree. That's what I also had in mind. If the consumer cannot deal with the handle nor with the result of get(handle) it should throw an exception (or use whatever other error handling they use).

Are you suggesting this to allow parallel execution?

Something similar to ray.get(list_of_futures) and dask_client.gather(list_of_futures).

I am trying to understand why you suggest this. A list comprehension on the user side is easy to write: [p['get'](x) for x in list_of_futures].

YarShev commented 2 years ago

Don't we lose an original positioning of the partitions in the grid in that case? That might be useful for consumers if they are more concerned about the positioning rather than mapping locations to list of partitions.

I agree this must be possible and I don't think we lose the information. We provide start and shape for each partition which implicitly provides the position in the grid. The question is what we think is the more common case. My experience shows that getting partitions per rank/location is the common case.

Mapping partitions per rank/location seems to me as well as the common case but I am not sure how the protocol could look like in that case. Can you give an example of that? Since the protocol is going to provide exactly the same underlying partition structure (without any repartition) it seems to me that for now information provided by the protocol looks more natural because consumers can see the partitioning at once.

Are you suggesting this to allow parallel execution?

Something similar to ray.get(list_of_futures) and dask_client.gather(list_of_futures).

I am trying to understand why you suggest this. A list comprehension on the user side is easy to write: [p['get'](x) for x in list_of_futures].

Passing a future instead of a list of futures to .get(...) may dramatically differ.

Ray

import ray
ray.init()

@ray.remote
def foo():
    from time import sleep
    sleep(10)
    return 1

%%time
ray.get([foo.remote() for _ in range(10)])
Wall time: 20.3 s
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

%%time
[ray.get(foo.remote()) for _ in range(10)]
Wall time: 1min 40s
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Dask

from distributed.client import Client
c = Client()
def foo():
     from time import sleep
     sleep(10)
     return 1

%%time
c.gather([c.submit(foo) for _ in range(10)])
Wall time: 10 s
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

%%time
[c.gather(c.submit(foo)) for _ in range(10)]
Wall time: 30.1 s
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

So we should provide more general API to be able to pass multiple futures.