fractal-analytics-platform / fractal-client

Command-line client for Fractal
https://fractal-analytics-platform.github.io/fractal-client
BSD 3-Clause "New" or "Revised" License
45 stars 1 forks source link

Crashes with 9 pyramid levels #90

Closed jluethi closed 2 years ago

jluethi commented 2 years ago

I've been trying to rerun the 23 well dataset with 9 pyramid levels and it failed with the following error message.

``` Traceback (most recent call last): File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 776, in sanitize_and_wrap new_inputs.extend([dep.result()]) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 287, in handle_exec_update res = self._unwrap_remote_exception_wrapper(future) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/parsl/dataflow/dflow.py", line 474, in _unwrap_remote_exception_wrapper result.reraise() File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/parsl/app/errors.py", line 138, in reraise reraise(t, v, v.__traceback__) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/six.py", line 719, in reraise raise value File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/parsl/app/errors.py", line 176, in wrapper return func(*args, **kwargs) # type: ignore File "../fractal/fractal_cmd.py", line 577, in app return dict_tasks[task](zarrurl, **kwargs_) File "/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/mwe_fractal/fractal/tasks/yokogawa_to_zarr.py", line 153, in yokogawa_to_zarr f_matrices[level] = da.coarsen( File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/core.py", line 2689, in rechunk return rechunk(self, chunks, threshold, block_size_limit, balance) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/rechunk.py", line 309, in rechunk chunks = tuple(_balance_chunksizes(chunk) for chunk in chunks) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/rechunk.py", line 309, in chunks = tuple(_balance_chunksizes(chunk) for chunk in chunks) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/rechunk.py", line 764, in _balance_chunksizes new_chunks = [ File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/rechunk.py", line 765, in _get_chunks(sum(chunks), chunk_len) File "/data/homes/jluethi/.conda/envs/fractal_38/lib/python3.8/site-packages/dask/array/rechunk.py", line 735, in _get_chunks leftover = n % chunksize ZeroDivisionError: integer division or modulo by zero ```

According to my calculations, 9 pyramid levels shouldn't lead to a chunk size of 0 anywhere. The smallest dimension of a whole well would be: 80x76 (approx 76, could be some rounding because it's not evenly divisible by 2).

I'm trying to rerun with 8 levels. If that doesn't work, I'll try with 5 levels where the pyramid sizes remain integers without rounding to see if the issue is there.

jluethi commented 2 years ago

Also, @tcompa when are running workflows supposed to appear on the parsl monitor? Neither the failed on nor the currently running workflow shows up on the monitoring for me. I restarted the monitoring service and this stays the same. The monitoring service was already running though when I submitted the jobs. Does that interfere with things?

jluethi commented 2 years ago

Hmm, weird. With 8 pyramid levels, the experiment ran through. Don't really understand why it fails with 9, but works with 8...

tcompa commented 2 years ago

I'll have a look now at the many-levels error.

Also, @tcompa when are running workflows supposed to appear on the parsl monitor?

As soon as workflow_apply submits jobs, i.e. as soon as you see them on squeue

Neither the failed on nor the currently running workflow shows up on the monitoring for me. I restarted the monitoring service and this stays the same. The monitoring service was already running though when I submitted the jobs. Does that interfere with things?

You are likely hitting a known error, which was fixed with a recent PR (https://github.com/Parsl/parsl/pull/2324) but not yet available in Fractal's parsl version. FYI, the bug is that parsl-visualize creates a wrong db: https://github.com/Parsl/parsl/issues/2266.

Quick workaround:

More robust solution: we should have our own parsl fork, with the patches we need. At the moment we are installing Jacopo's fork, which however branches off their dev branch, rather than from their stable 1.2 version.

tcompa commented 2 years ago

Quick check: are you sure that you are using a 2x coarsening? Because if it were 3x, 8 or 9 levels would be close to the maximum possible value, see e.g.

0 2160*8=17280
1 5760
2 1920
3 640
4 213
5 71
6 23
7 7
8 2

Anyway, I'm testing this and I am adding an explicit check during pyramid creation.

jluethi commented 2 years ago

My bad @tcompa! I mistakenly had the coarsening at 4 (which actually is quite a bad default, as it hurts performance of visualization quite a bit I think, but that is to be tested => I'll report back when I have rerun it with actual pyramid levels of 2).

Looking forward to having a unified pipeline file, because now I sometimes forget to change some parameters in one of the settings files after I pull in changes again from the repo.

So the error is correct then. We could think about whether there is a way to check for this early on, but I think the pipeline fails "fast enough" so that it isn't a huge loss of time

jluethi commented 2 years ago

You are likely hitting a known error

Ok, it's not urgent to create workarounds for me at the moment, I'm looking forward to this fix then :)

tcompa commented 2 years ago

I added more explicit checks in b009393bc8588ad3f195d4731a54ad1e24761321 and ed9c62ea4c408ef4990b654d08a54cbffb18e54d. Closing this issue.