Closed KatharineShapcott closed 3 years ago
Hey Katharine!
That is indeed frustrating and extremely difficult to debug. I agree, this should definitely be remedied somehow. First step is catching the failed HDF5 write and logging the error, so the user immediately sees the problem. So, I guess we might want to try
/except
saving the results in HDF5, if that does not work, log the error, then attempt to pickle it, if that does not work either return it. Would that make sense?
PS I'm happy to update the readme with more details about what can and can't be used as an output.
That would be very much appreciated. Thank you!
Try/except with a better error message sounds like a great idea, maybe you can also return which of the results (if it's a list or tuple) the crash occurred on? I'm not sure how the user will handle occasional pickling because they'd have to load in some data differently than others. I think for usability a flag should be able to be set by the user to do either all hdf5 or all pickle. Then if I know my outputs don't work with hdf5 I can switch my code to pickle. But an emergency pickle would really help with debugging a weird problem like this.
Try/except with a better error message sounds like a great idea, maybe you can also return which of the results (if it's a list or tuple) the crash occurred on?
Yes, absolutely, good point!
I'm not sure how the user will handle occasional pickling because they'd have to load in some data differently than others. I think for usability a flag should be able to be set by the user to do either all hdf5 or all pickle. Then if I know my outputs don't work with hdf5 I can switch my code to pickle. But an emergency pickle would really help with debugging a weird problem like this.
Hm, yes, that's right. Having (potentially) hundreds of HDF5 files with the occasional pickle-dump mixed in between does not sound too pleasant from a data collecting perspective... I think the flag is a great idea! But then I'm wondering if an emergency in-memory return might be better/easier to understand than an unsolicited pickle-dump. What do you think?
Hm, yes, that's right. Having (potentially) hundreds of HDF5 files with the occasional pickle-dump mixed in between does not sound too pleasant from a data collecting perspective... I think the flag is a great idea! But then I'm wondering if an emergency in-memory return might be better/easier to understand than an unsolicited pickle-dump. What do you think?
Tried that, if the data is too big then the job never returns properly which is also hard to debug. "Too big" is also hard to define, seems to be dependent on how busy the cluster is! I think a pickle is safer in this case. Also pickle might give a nicer error message if there's still a problem there?
Okay, good point - pickle it is then! If the output is large, returning things always runs the risk of killing the parent session anyway. I guess we have a plan, then!
"Too big" is also hard to define, seems to be dependent on how busy the cluster is!
I'm still banging my head against the desk with these "too big" dask bags (cf. #23). I thought I had a solution yesterday but it only worked for one array in the input spec of the user function. Which would be fine, but I don't know why it works for one array but not for more. From what I'm seeing, dask starts expanding the constructed bags before calling the workers flooding the caller's memory, which in turn causes SLURM to just unceremonially kill the parent before the workers even got any data.
Hi Katharine!
I just pushed the (hopefully) final commit to the pickle_save
branch (d524867). It includes the discussed emergency pickling mechanic + a new write_pickle
keyword. In addition, I included a custom exception handler that should catch CTRL + C
keyboard interrupts and perform a graceful shutdown of any running dask client + workers to avoid detaching SLURM jobs from the managing computing client (cf #23). Please feel free to test-drive the changes whenever you have time, then I'll merge into main.
Thank you again for all your help!
Just tested the keyboard interrupt thing, doesn't seem to work from jupyter at least. Jobs are still running 2 mins later.
#%% try multilabel 200 times...
<esi_cluster_setup> SLURM workers ready: 0/35 [elapsed time 00:00 | timeout at 01:00]<esi_cluster_setup> Requested job-count 50 exceeds `n_jobs_startup`: waiting for 35 jobs to come online, then proceed
<esi_cluster_setup> SLURM workers ready: 50/None [elapsed time 00:07 | timeout at 01:00]
<ParallelMap> INFO: Attaching to global parallel computing client <Client: 'tcp:// .4:33113' processes=50 threads=50, memory=400.00 GB>
<ParallelMap> INFO: Preparing 200 parallel calls of `comparison_classifier` using 50 workers
<ParallelMap> INFO: Log information available at /mnt/hpx/slurm/shapcottk/shapcottk_20210421-110115
<esi_cluster_setup> Cluster dashboard accessible at http:// .4:8787/status
0% | | 0/200 [00:00<?]
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
~/python/filter_net_paper/scripts/filter_net_figure7.py in <module>
11 with ParallelMap(comparison_classifier, clf, dataset=dataset, train_size=train_sizes,
12 n_inputs=n_trys, write_worker_results=write) as pmap:
---> 13 results = pmap.compute()
/mnt/pns/home/shapcottk/python/filter_net_paper/scripts/acme/backend.py in compute(self, debug)
363 cnt = 0
364 while any(f.status == "pending" for f in futures):
--> 365 time.sleep(self.sleepTime)
366 new = max(0, sum([f.status == "finished" for f in futures]) - cnt)
367 cnt += new
KeyboardInterrupt:
Hmm I tried to use a classifier that I know returns sparse data and got some really odd errors. Looks like sparse data can't be pickled either. Shall I try and make a minimum working example or is this enough for you?
...
distributed.scheduler - INFO - Remove worker <Worker 'tcp:// :43993', name: 24, memory: 0, processing: 145>
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Register worker <Worker 'tcp:// :46241', name: 14, memory: 0, processing: 7>
distributed.scheduler - INFO - Starting worker compute stream, tcp:// :46241
distributed.scheduler - INFO - Register worker <Worker 'tcp:// :33629', name: 31, memory: 0, processing: 1>
distributed.scheduler - INFO - Starting worker compute stream, tcp:// :33629
...
distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: <Worker 'tcp://
:37137', name: 23, memory: 0, processing: 1>, Got: <Worker 'tcp:// :39263', name: 8, memory: 0, processing: 0>, Key: ('from_sequence-b24b1b22c80e07407f4b29e957d0a408', 0)
...
distributed.scheduler - INFO - Register worker <Worker 'tcp:// :36881', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp:// .26:36881
distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing. Expected: <Worker 'tcp://
.22:37137', name: 23, memory: 0, processing: 1>, Got: <Worker 'tcp:// .27:46241', name: 14, memory: 0, processing: 0>, Key: ('from_sequence-b24b1b22c80e07407f4b29e957d0a408', 0)
...
Please consult the following SLURM log files for details:
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982346.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982321.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982324.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982316.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982330.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982358.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982314.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982320.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982328.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982333.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982338.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982356.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982312.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982318.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982326.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982331.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982354.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982315.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982325.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982329.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982344.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982313.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982319.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982327.out
/mnt/hpx/slurm/shapcottk/shapcottk_20210421-111042/slurm-3982317.out
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/mnt/pns/home/shapcottk/python/filter_net_paper/scripts/acme/shared.py in ctrlc_catcher(*excargs, **exckwargs)
338
339 import IPython
--> 340 IPython.core.interactiveshell.InteractiveShell.showtraceback(*excargs)
341
342 # Relay exception handling back to system tools
~/.conda/envs/acme/lib/python3.8/site-packages/IPython/core/interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
2021 try:
2022 try:
-> 2023 etype, value, tb = self._get_exc_info(exc_tuple)
2024 except ValueError:
2025 print('No traceback available to show.', file=sys.stderr)
~/.conda/envs/acme/lib/python3.8/site-packages/IPython/core/interactiveshell.py in _get_exc_info(self, exc_tuple)
1969 etype, value, tb = sys.exc_info()
1970 else:
-> 1971 etype, value, tb = exc_tuple
1972
1973 if etype is None:
TypeError: cannot unpack non-iterable type object
The original exception:
Hi!
Hm, that's the (apparently dysfunctional) exception handler crashing. However, the line IPython.core.interactiveshell.InteractiveShell.showtraceback(*excargs)
should not be there, that was a WIP commit. Could you do a git pull in the pickle_save
branch and try again?
Hi!
Just tested the CTRL + C
catcher with the latest version of the pickle_save
branch - it does what it's supposed to do in my notebook:
Here the code:
# Add acme to Python search path
import os
import sys
acme_path = os.path.abspath(".." + os.sep + "..")
if acme_path not in sys.path:
sys.path.insert(0, acme_path)
from acme import ParallelMap
import time
import dask.distributed as dd
def long_running(dummy):
time.sleep(30)
return
with ParallelMap(long_running, [None]*10, setup_interactive=False, write_worker_results=False) as pmap:
pmap.compute()
Hm, that's the (apparently dysfunctional) exception handler crashing. However, the line
IPython.core.interactiveshell.InteractiveShell.showtraceback(*excargs)
should not be there, that was a WIP commit. Could you do a git pull in thepickle_save
branch and try again?
All seems fine now, this doesn't crash anymore and everything returns.
Thanks for the test code, that also works for me. It seems to be because I'm using my own client. This reproduces the issue:
from acme import ParallelMap, esi_cluster_setup
import time
def long_running(dummy):
time.sleep(30)
return
n_jobs = 10
client = esi_cluster_setup(partition="8GBXS",n_jobs=n_jobs)
with ParallelMap(long_running, [None]*n_jobs, setup_interactive=False, write_worker_results=False) as pmap:
pmap.compute()
Ha, interesting. Thank you for the example! Same for me both locally and on the cluster. I'll look into this.
Hi Katharine!
I think I've figured it out (finally). Following the official iPython/Jupyter guidelines for creating custom exception handlers (via get_ipython().set_custom_exc((Exception,), custom_exc)
) did not work, hacking it apparently did ;) I've just pushed the changes to the pickle_save
branch (a git pull
should bring you up to speed). I've tried the code in Python, iPython and Jupyter - whenever you have time, please feel free to test it again. Thank you!
Hi Stefan, Jupyter is so much fun ;) Whatever you did it seems to work now! Thanks so much for fixing that. Best, Katharine
Btw when you ctrl c and then try and rerun your code without restarting the kernel, the acme print out is broken. Doesn't happen when it completes successfully. Not very important but maybe an easy fix?
<ParallelMap> INFO: <esi_cluster_setup> Requested job-count 50 exceeds `n_jobs_startup`: waiting for 10 jobs to come online, then proceed
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:00 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:01 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:01 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:02 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:02 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:03 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 0/10 [elapsed time 00:03 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 5/10 [elapsed time 00:04 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 5/10 [elapsed time 00:04 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 18/None [elapsed time 00:05 | timeout at 01:00]A
<esi_cluster_setup> SLURM workers ready: 18/None [elapsed time 00:05 | timeout at 01:00]
<ParallelMap> INFO: <esi_cluster_setup> Cluster dashboard accessible at http:// :8787/status
<ParallelMap> INFO: Attaching to global parallel computing client <Client: 'tcp:// :38687' processes=18 threads=18, memory=144.00 GB>
<ParallelMap> INFO: Preparing 50 parallel calls of `comparison_classifier` using 50 workers
<ParallelMap> INFO: Log information available at /mnt/hpx/slurm/shapcottk/shapcottk_20210428-094258
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:00<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:01<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:02<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:03<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:04<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:05<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:06<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:07<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:08<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:09<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:10<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:11<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:12<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:13<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:14<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:15<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:16<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:17<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:18<?]A
0% | | 0/50 [00:19<?]A
0% | | 0/50 [00:19<?]A
0% | | 0/50 [00:19<?]A
2% |▏ | 1/50 [00:19<00:04]A
4% |▍ | 2/50 [00:25<01:34]A
6% |▌ | 3/50 [00:26<01:10]A
10% |█ | 5/50 [00:26<00:50]A
12% |█▏ | 6/50 [00:26<00:35]A
14% |█▍ | 7/50 [00:27<00:40]A
16% |█▌ | 8/50 [00:27<00:28]A
18% |█▊ | 9/50 [00:28<00:27]A
20% |██ | 10/50 [00:28<00:19]A
22% |██▏ | 11/50 [00:29<00:18]A
24% |██▍ | 12/50 [00:29<00:13]A
26% |██▌ | 13/50 [00:29<00:18]A
32% |███▏ | 16/50 [00:30<00:12]A
38% |███▊ | 19/50 [00:30<00:08]A
42% |████▏ | 21/50 [00:30<00:05]A
46% |████▌ | 23/50 [00:30<00:04]A
50% |█████ | 25/50 [00:31<00:05]A
54% |█████▍ | 27/50 [00:31<00:04]A
56% |█████▌ | 28/50 [00:31<00:03]A
58% |█████▊ | 29/50 [00:31<00:03]A
60% |██████ | 30/50 [00:31<00:03]A
62% |██████▏ | 31/50 [00:32<00:02]A
64% |██████▍ | 32/50 [00:32<00:02]A
66% |██████▌ | 33/50 [00:32<00:02]A
72% |███████▏ | 36/50 [00:32<00:01]A
76% |███████▌ | 38/50 [00:32<00:01]A
78% |███████▊ | 39/50 [00:32<00:01]A
80% |████████ | 40/50 [00:33<00:02]A
82% |████████▏ | 41/50 [00:33<00:01]A
84% |████████▍ | 42/50 [00:33<00:01]A
86% |████████▌ | 43/50 [00:33<00:01]A
90% |█████████ | 45/50 [00:33<00:00]A
94% |█████████▍| 47/50 [00:34<00:00]A
96% |█████████▌| 48/50 [00:34<00:00]A
98% |█████████▊| 49/50 [00:34<00:00]A
100% |██████████| 50/50 [00:35<00:00]
<ParallelMap> INFO: SUCCESS! Finished parallel computation. Results have been saved to /mnt/hpx/home/shapcottk/ACME_20210428-094303-486253
Hey Katharine! This looks "interesting"... I just pushed a new commit to the pickle_save
branch in order to try to force tqdm
to stay on the line it was initially printing to. Works on my machine - please feel free to try it whenever you have time :)
Nice! Seems to work now, thanks!
Cool - thanks for the quick test-drive! Will merge into main then :)
Hi Stefan, Since yesterday we were talking about jobs crashing after a long runtime I thought I should report this. I didn't even realise that some classifiers output sparse predictions and others don't! Unfortunately I got a very unhelpful error message about this so it took me some time to figure out:
In my case 'issparse' isn't even enough to catch it because sometimes they're sparse within a list! Of course this doesn't happen without writing to disk so my code ran fine whenever I tested it without acme. Maybe there could be a more informative error message if this is happening? Or we could return that part of the data so the user can see what's causing the crash?
Thanks! Katharine
PS I'm happy to update the readme with more details about what can and can't be used as an output.