Closed KatharineShapcott closed 3 years ago
Hi! Thanks for detailed bug report! This is really strange - running your code locally on my workstation hangs and then crashes the entire session:
<ParallelMap> INFO: Log information available at http://127.0.0.1:8787/info/main/workers.html
0% 0/200 [00:19<?]Killed
$ /opt/software/conda/envs/acme/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown
So the problem seems to be not SLURM specific. I'll do some digging, what goes wrong here...
It's definitely something to do with that "large array", which isn't that big, which is why I was sending it. I wonder if it depends on how many jobs you have running? Maybe it's multiplying the size of the array by the number of jobs in local memory or something? Worse case scenario you could detect the input sizes and tell users to load arrays over size X from disk.
But what's also odd is that for my code at least this wasn't happening a few months ago. It worked fine to send this size of array.
Hm, this is really strange. The offending part is in compute
(in backend.py
): firstArg = db.from_sequence(...)
. This takes forever and eventually crashes if the number of workers is large enough.
Initially I thought that maybe enforcing npartitions=ncalls
in the creation of the dask bag might be the issue (cf. #6), but letting dask partition the bag on its own does not change anything. I've also tried using generators in the from_sequence
call (same result) and started playing around with from_delayed
(which behaves very unintuitively for 2d arrays...). What puzzles me the most is that it worked before, as you said. Do you happen to know which version of ACME you used (commit hash)?
So no concrete solution for now, but at least copying input args ncalls
times and shoving them into lists is less than optimal. This might be okay for scalars (which were my one and only test case...), but even moderately sized NumPy arrays can overflow the memory of the caller if copied often enough (take a 4MB array and scale it to 5000 workers; suddenly the ACME caller has to deal with a 20GB input...). I will definitely modify this to not have lists hanging around there. The dask-bag situation requires some additional meditating, though...
Huh that's really odd... Okay good to know that, I was kinda thinking dask might have done some clever memory sharing thing between its jobs but I guess it doesn't.
My version of acme is old. Did you change something in the cluster that could cause an issue with fewer jobs than before? Because I would have sworn this was working just before I went on holiday at the beginning of February with up to 400 jobs, and now 26 doesn't even work...
ACME: commit 95403e4632fd878e69c6fdd0ff82f0b787765044 (HEAD -> main, origin/main, origin/HEAD) Merge: 1d24c6f 3979789 Author: Stefan Fuertinger pantaray@users.noreply.github.com Date: Tue Dec 29 14:18:14 2020 +0100
My dask stuff is:
Hi Katharine!
I just pushed a commit that uses a completely revamped argument distribution mechanic: instead of relying on bone-headed copies of variables that are needed by all workers, we now broadcast the respective variable to the entire worker-cluster and create a list of references to the resulting future that gets distributed when calling a user-provided function. I've tested the new mechanic with the minimal (not-)working example you provided (thank you again for cooking this up - that was tremendously helpful!): ACME processes the example both with SLURM and using a LocalCluster
without hiccups or memory flooding. If you have a chance, please feel free to test-drive the new version (commit 9791cb7 in the dev
branch). If this works for you, I'll merge into master.
Hi Katharine! Did you have a chance already to test the new version of ACME? I'm also almost done with the emergency pickling patch. As soon as we're sure everything works, I'll merge into main :)
Sorry I haven't yet because I rewrote all my code to avoid sending arrays! I'll try to squeeze it in ASAP.
All good, thank you for testing!
Hi Stefan, My jobs crashed on the dev branch with the same code that works when I use it on the main branch. It looks like the first job of a worker returns successfully and then crashes on each job called after that. I'm using the modified code where I load in a different subset of the data each time (e.g. there aren't being sent via slurm) and it doesn't seem to like it.
Are you maybe making some assumptions about the size of the output now?
First error message in slurm log:
distributed.worker - WARNING - Compute Failed
Function: func_wrapper
args: (MFProbabilistic())
kwargs: {'dataset': 'reuters', 'train_size': 0.7935033498304951, 'outDir': '/mnt/hpx/home/shapcottk/ACME_20210421-094302-881385', 'outFile': 'comparison_classifier_157.h5', 'taskID': 157, 'userFunc': <function comparison_classifier at 0x7feb9731d5e0>}
Exception: ValueError('operands could not be broadcast together with shapes (1296,) (18368,) ')
This is then repeated, as if the first 1296 array was never cleared:
Exception: ValueError('operands could not be broadcast together with shapes (1296,) (12678,) ')
Exception: ValueError('operands could not be broadcast together with shapes (1296,) (6987,) ')
Exception: ValueError('operands could not be broadcast together with shapes (1296,) (10174,) ')
On another worker it looks like this (also repeated):
Exception: ValueError('operands could not be broadcast together with shapes (1979,) (7670,) ')
Acme error:
<esi_cluster_setup> SLURM workers ready: 0/35 [elapsed time 00:00 | timeout at 01:00]<esi_cluster_setup> Requested job-count 50 exceeds `n_jobs_startup`: waiting for 35 jobs to come online, then proceed
<esi_cluster_setup> SLURM workers ready: 50/None [elapsed time 00:07 | timeout at 01:00]
<ParallelMap> INFO: Attaching to global parallel computing client <Client: 'tcp:// .4:45339' processes=50 threads=50, memory=400.00 GB>
<esi_cluster_setup> Cluster dashboard accessible at http:// .4:8787/status
<ParallelMap> INFO: Preparing 200 parallel calls of `comparison_classifier` using 50 workers
<ParallelMap> INFO: Log information available at /mnt/hpx/slurm/shapcottk/shapcottk_20210421-094254
25% |██▌ | 50/200 [01:20<04:00]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/python/filter_net_paper/scripts/filter_net_figure7.py in <module>
11 with ParallelMap(comparison_classifier, clf, dataset=dataset, train_size=train_sizes,
12 n_inputs=n_trys, write_worker_results=write) as pmap:
---> 13 results = pmap.compute()
/mnt/pns/home/shapcottk/python/filter_net_paper/scripts/acme/backend.py in compute(self, debug)
422
423 # Finally, raise an error and get outta here
--> 424 raise RuntimeError(msg)
425
426 # If wanted (not recommended) collect computed results in local memory
RuntimeError: <ACMEdaemon> Parallel computation failed: 150/200 tasks failed or stalled.
Concurrent computing scheduler log below:
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp:// :45339
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - Register worker <Worker 'tcp:// .22:43819', name: 19, memory: 3, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp:// .22:43819
...
Another possibility is that the MFProbabilistic() arg persists between calls, that would also cause a crash when used with different input sizes.
Hi Katharine!
Thanks so much for testing this! I think your hunch is right here: the MFProbabilistic()
arg might be the culprit. The updated code does not use dask bags any more for argument distrubtion (to avoid memory overflow) but instead broadcasts (presumably identical) args to the worker cluster for use of all workers (i.e., by reference, no copies are generated). The relevant code is:
I assume MFProbabilistic()
is called repeatedly to randomly select data? The above snippet would not do that.
Hi Stefan, Yes MFProbabilistic() was the problem. It's a classifier that then was being fit multiple times (I need to handle that correctly for the future).
Looks like it's working! I sent a very large array of 60000x784 to both 50 and 189 workers from the 8GBVIS cue and had no problems at all. Thanks so much for fixing that! Best, Katharine
Hey Katharine!
Thank you for testing this! Great to hear the changes work :) I think I'll include a warning in the code: if ParallelMap
is launched with a single callable object as input arg to the user-provided function, the callable will only be executed once for all workers. I think there are scenarios where this makes sense (e.g., loading a file with global parameters) but there are other circumstances (like the one you ran into) where this behavior leads to problematic side effects. I don't think there is a "right" or "wrong" way here - unless you have any objections, I'd leave the default as it is (single callable is called once). If it is to be called separately by each worker, it has to be specifically provided like [MFProbabilistic] * n_jobs
(although I'm not sure if NumPy would then use the same seed for all workers...).
Hello! My code executes fine when running from an interactive node. But from a SLURM interactive session in the 8GBVIS cue it doesn't work. Here's the output in my terminal when it fails:
It hangs and then after some minutes I see the 'Killed' message and the jobs end. If I try and do Ctrl+C while it's hanging e.g. #23 the jobs do not actually end. This is repeatedly happening if I try and reuse the same client for a second batch of jobs (as I did in the code above).
I now managed to get the same behaviour using some test code where I'm sending a large array as an input to the function. This causes issues from both SLURM and the interactive node:
Here's the output this code causes:
Here's an example slurm output: slurm-3350545.out.txt