Closed j-bennet closed 6 years ago
If you were willing to add to the documentation, that would be highly appreciated (although I am aware of other efforts to make a more seamless python/yarn experience).
In https://github.com/dask/knit/issues/123 you have a nice working example of how to use the client for a cluster started in the same process.
To reuse that cluster from other python processes, you can find the address to connect to from cluster.local_cluster.scheduler_address
, and use this in your call to Client()
.
>>> cluster = DaskYARNCluster(env='/home/hadoop/reqs/dvss.zip')
>>> cluster.start(n_workers=3, memory=500, cpus=5, checks=True)
>>> cluster.local_cluster.scheduler_address
"tcp://10.56.73.28:8786"
in other sessions
>>> client = distributed.Client("tcp://10.56.73.28:8786")
@martindurant Well, my example is not quite working, so I'm sure I'm doing something wrong. I have this problem: after my application completes, it is being killed, with some scary error messages:
url referrer session_id \
0 http://a.com/articles/15 http://google.com/ yyy
1 http://a.com/articles/17 http://google.com/ xxx
2 http://a.com/articles/7 http://facebook.com/ yyy
ts hour customer
0 2017-09-17 00:09:00 0 a.com
1 2017-09-17 00:21:00 0 a.com
2 2017-09-17 00:00:00 0 a.com
18/04/14 22:46:44 INFO impl.YarnClientImpl: Killed application application_1523740467851_0005
2018-04-14 22:46:45,108 - asyncio - ERROR - Future exception was never retrieved
future: <Future finished exception=CommClosedError('in <closed TCP>: Stream is closed',)>
Traceback (most recent call last):
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 179, in read
n_frames = yield stream.read_bytes(8)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 200, in read
convert_stream_closed_error(self, e)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
2018-04-14 22:46:45,109 - asyncio - ERROR - Future exception was never retrieved
future: <Future finished exception=CommClosedError('in <closed TCP>: Stream is closed',)>
Traceback (most recent call last):
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 179, in read
n_frames = yield stream.read_bytes(8)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 200, in read
convert_stream_closed_error(self, e)
File "/home/hadoop/conda/envs/dvss/lib/python3.6/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
You see it prints out those 3 records from my dataframe, and then 💣. Ouch. This is not what I want to see.
And yes, once I get things right, I'd be happy to PR a documentation example!
"KILLED" sounds like an action on behalf of yarn. I'm afraid you will need to look in the resourcemanager logs to see why that happened. Was there any message from the process that launched the dask_yarn cluster?
The comm-closed errors you see above are all in response to having the worker disappear while, apparently, communication was in process.
Well, once my script ends, the cluster exits, right? and this is something that the cluster does when it exits. It kills the application:
https://github.com/dask/knit/blob/24a5e33ab5d6b4235cadf79f7e4244eb9935e9b4/dask_yarn/core.py#L73
and
https://github.com/dask/knit/blob/24a5e33ab5d6b4235cadf79f7e4244eb9935e9b4/dask_yarn/core.py#L173
It looks like I just don't know the proper way to finish this application.
Ah, so you need your "master" script, i.e., the one that starts the cluster, to hang around. A while True: time.sleep(1)
loops would do... We don't currently have a way to put the scheduler process into a persistent state detached from the process that launches it.
This is why I need an example. It's not obvious how I should do things.
So, ok, now I have this script that starts a YARN cluster:
from time import sleep
import dask_yarn
cluster = dask_yarn.DaskYARNCluster(env='/home/hadoop/reqs/dvss.zip', lang='en_US.UTF-8')
cluster.start(n_workers=3, memory=5*1024, cpus=5)
try:
while True:
print('Cluster scheduler: {}.'.format(cluster.scheduler_address))
sleep(20)
except KeyboardInterrupt:
print('Cluster is done.')
Now, how do I make my original script to use this cluster?
Presumably this prints out a string of the form "tcp://ip.address:port"? That's the thing that you want to pass to Client()
in your other script.
Ok, so the other script becomes just this:
import sys
import dask.dataframe as dd
from dask.distributed import Client
if __name__ == '__main__':
address = sys.argv[1]
print('Connecting to address: {}.'.format(address))
client = Client(address=address)
df = dd.read_parquet('s3://parsely-public/jbennet/daskvsspark/events/100-24/*/*/*/*/*/*.parquet')
print('-' * 20)
print(df.head(3))
and I run it providing my scheduler address: python t2.py tcp://10.21.0.219:35225
.
This seems to work. But I still have a ton of questions!
DaskYARNCluster
, and killed when I stop the cluster. So this is completely different from how things are done in Spark, where one script run would correspond to one YARN application. Is that correct?FINISHED
not KILLED
?Indeed, this method is not the same as the spark "submit" one - the scheduler is not even running within yarn, only the workers. However, it is still useful. Whether you keep the cluster going, or start a new cluster per user or per script, that is a design question - you could keep the cluster up indefinitely if you wanted.
The dask workers do not have a convenient way to exit. They can be stopped, but there is a controlling "nanny" process tasked with creating a new worker process when that happens, and there is no way to tell the "nanny" to stop doing that. Perhaps we could circumvent the nanny, but killing the container or the application seems the achieve the stated purpose. There is a method in the ApplicationMaster for it to cleanly shut itself down, but calling it proved problematic.
A few small comments:
The knit repo includes the dask_yarn (python) package https://github.com/dask/knit/tree/master/dask_yarn
@mrocklin @jcrist Replacement to knit
? What will it be?
It is taking shape at https://github.com/jcrist/skein , a re-imagining of knit from the ground-up. Hopefully all dask-on-yarn users will become happy.
Oh, I see! I think I visited that repo before in my stumbling around, and was turned away by readme:
There is no promised timeline, yet, but there is definitely development unlike, for the most part, here on knit
:)
I looked all over Knit documentation, but I could not find an example of running a dask/python script on yarn.
Let's say I already have an application that runs with dask.distributed client:
t3.py
How do run this with the YARN scheduler?
P.S. I am aware that this is a dummy question, sorry.