Open kkraus14 opened 6 years ago
Hrm, so this works fine for me both on master and latest release
In [1]: from dask.distributed import Client, LocalCluster
...: import pandas as pd
...: import dask.dataframe as dd
...: cluster = LocalCluster(processes=False)
...: cpu_worker = cluster.workers[0]
...: cpu_worker.name = 'cpu'
...: cpu_worker.set_resources(CPU=80)
...: client = Client(cluster)
...: pdf = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
...: test_df = dd.from_pandas(pdf, npartitions=2)
...: test_df.compute(resources = {tuple(test_df.__dask_keys__()): {'CPU': 1}})
...:
Out[1]:
a b
0 1 4
1 2 5
2 3 6
I might also suggest the following test which sets up resources and names when creating the workers and verifies that tasks are allocated appropriately by checking the structured log.
from dask.distributed import Client, LocalCluster
import pandas as pd
import dask.dataframe as dd
cluster = LocalCluster(n_workers=0, processes=False)
client = Client(cluster)
alice = cluster.start_worker(resources={'CPU': 80}, name='alice')
bob = cluster.start_worker(name='bob')
pdf = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
ddf = dd.from_pandas(pdf, npartitions=2)
ddf.compute(resources = {tuple(ddf.__dask_keys__()): {'CPU': 1}})
assert alice.log
assert not bob.log
The exception is odd. If you were using something other than LocalCluster I would guess that you had a version mismatch between your workers or between you workers and client, but given that everything is local I don't see how this could be. How did you install Dask? I don't suppose you can provide a conda environment.yml or something similar that reproduces the problem? (my guess would be that this is challenging, but thought I'd ask anyway)
I was on Dask 0.17.2 and just confirmed the exception issue is resolved when I upgraded to Dask 0.18.1. Thanks!
I'm planning on chaining together a number of functions, is there any way to specify the resources when calling the functions as opposed to when calling .compute
?
I agree that that would be valuable but currently no, resources are specific to the distributed scheduler, while collections like dask.delayed and dask.dataframe are scheduler agnostic. This is something that could be improved though. I don't know how at the moment, but there is likely a better way around this.
On Wed, Jul 18, 2018 at 3:26 PM, Keith Kraus notifications@github.com wrote:
I was on Dask 0.17.2 and just confirmed the exception issue is resolved when I upgraded to Dask 0.18.1. Thanks!
I'm planning on chaining together a number of functions, is there any way to specify the resources when calling the functions as opposed to when calling .compute?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2127#issuecomment-406046328, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIOzmEzbsOnuP4NcL34IiTrulFxIks5uH4vbgaJpZM4VVH3R .
So for dd.read_csv
if I call __dask_keys__()
it only returns the from-delayed
tasks while it looks like there's also pandas_read_text
and read-block
tasks which end up getting scheduled on the GPU workers. Is there a different function or a snippet which given an object returns every key that we need to define the resources for?
I.E.
test = dd.read_csv("/path/to/some/file")
resources = {tuple(test.getallkeys()): {'CPU': 1}}
test.compute()
Hrm, short term list(test.dask) would probably serve your needs. This would include all keys that are used to create this dataset
On Wed, Jul 18, 2018 at 4:07 PM, Keith Kraus notifications@github.com wrote:
So for dd.read_csv if I call __dask_keys__() it only returns the from-delayed tasks while it looks like there's also pandas_read_text and read-block tasks which end up getting scheduled on the GPU workers. Is there a different function or a snippet which given an object returns every key that we need to define the resources for? I.E.
test = dd.read_csv("/path/to/some/file") resources = {tuple(test.getallkeys()): {'CPU': 1}} test.compute()
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2127#issuecomment-406057657, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszFA91sIOquOWIBwdGJQffTk_57ETks5uH5VmgaJpZM4VVH3R .
Hmm, I'd expect the following to work but it's still scheduling tasks on the GPU workers including the from-delayed
tasks as well:
test = dd.read_csv("/path/to/some/file")
resources = {tuple(test.dask): {'CPU': 1}}
test.compute()
Hrm, can you try passing compute(optimize_graph=False) ?
On Wed, Jul 18, 2018 at 4:18 PM, Keith Kraus notifications@github.com wrote:
Hmm, I'd expect the following to work but it's still scheduling tasks on the GPU workers including the from-delayed tasks as well:
test = dd.read_csv("/path/to/some/file") resources = {tuple(test.dask): {'CPU': 1}} test.compute()
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2127#issuecomment-406060943, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAFCCwkpG1u4PnNiPQVbJij8XpqKks5uH5gtgaJpZM4VVH3R .
Still the same behavior. (Note my above example forgot to specify resources in the compute call but I am in fact setting it while testing)
I'll take a look sometime today.
OK, it looks like this is failing to support tuple-based keys in the .get
path. Should be an easy fix.
Short term you could do this as a workaround:
result = client.compute(df).result()
My apologies for the dust here. Most users of resources
historically have been doing more custom computations (delayed, futures) and have been using the client API. The code paths around using them with the standard collections (array, dataframe) have not been as well travelled. I'll push a fix for this in a bit.
If you use optimize_graph=False
then https://github.com/dask/distributed/pull/2131 should solve your immediate issue. There is still a bit of work to clear up this situation generally though and make it more usable.
@mrocklin Unfortunately I have some pretty hard time constraints for what I'm working on where creating 8 dask workers with a single GPU visible is working well enough for my needs currently, but I'll hopefully have time to revisit this late next week to continue troubleshooting with you towards a solution. Apologies for the delay!
It's just fine. This has been a useful exercise to flush out some bugs, but technical bugs and usability bugs, with using resources with collections.
Good luck!
I'm trying to specify resources for builtin dask functions such a
dd.read_csv
, with an end goal of running certain functions on "CPU workers" and other functions on "GPU workers". Here's a minimal example of trying to forcedd.read_csv
to run only on my "CPU worker":This returns the following:
It would be great if you could specify resources as you create tasks as opposed to when computing them, similar to how you can with
client.submit
I.E.