dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

RecursionError when using PerformanceReport context manager #8578

Open jinmannwong opened 8 months ago

jinmannwong commented 8 months ago

Describe the issue:

When executing certain custom task graphs with the PerformanceReport context manager I get log warnings like the following:

2024-03-13 14:43:10,263 - distributed.sizeof - WARNING - Sizeof calculation failed. Defaulting to -1 B
Traceback (most recent call last):
  File ".../site-packages/distributed/sizeof.py", line 17, in safe_sizeof
    return sizeof(obj)
  File ".../site-packages/dask/utils.py", line 773, in __call__
    return meth(arg, *args, **kwargs)
  File ".../site-packages/dask/sizeof.py", line 96, in sizeof_python_dict
    + sizeof(list(d.values()))
  File ".../site-packages/dask/utils.py", line 773, in __call__
    return meth(arg, *args, **kwargs)
  File ".../site-packages/dask/sizeof.py", line 59, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))
  File ".../site-packages/dask/utils.py", line 773, in __call__
    return meth(arg, *args, **kwargs)

which repeats until it finally ends with

  File ".../site-packages/dask/sizeof.py", line 59, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))
RecursionError: maximum recursion depth exceeded

The computation still completes correctly and this problem doesn't arise when executing without the performance report.

Minimal Complete Verifiable Example:

This is a small example code that reproduces the problem, where I am using the xarray data from https://github.com/pydata/xarray-data/blob/master/rasm.nc.

from dask.distributed import Client, performance_report
import xarray as xr

dask_graph = {"source": (xr.load_dataset, "rasm.nc")}
with Client() as client:
    with performance_report(filename="dask-report.html"):
        client.get(dask_graph, "source")

Environment:

jrbourbeau commented 8 months ago

Thanks for the report @jinmannwong. Unfortunately I'm not able to reproduce with the following steps:

# Create a fresh software environment with the specified version of `dask`
$ mamba create -n test python=3.11 dask=2024.2.0 xarray netcdf4
$ mamba activate test
$ python test.py

where test.py is:

from dask.distributed import Client, performance_report
import xarray as xr

dask_graph = {"source": (xr.load_dataset, "rasm.nc")}
if __name__ == "__main__":
    with Client() as client:
        with performance_report(filename="dask-report.html"):
            client.get(dask_graph, "source")

I also tried with the latest dask + distributed release and things works as expected.

Are you doing something different than what I described above? What's the output of running $ dask info versions in your terminal? Also, what version of xarray are you using?

jinmannwong commented 8 months ago

Thank for looking into this. I was running on a virtual environment that had a lot of other dependencies installed and indeed when I ran with just the required dependencies the problem didn't arise. I combed through the other dependencies I had and realised that the problem arises due to the installations of cupy-cuda11x=13.0.0 and jax=0.4.25 together. When I try running with the dependencies you listed and then one of cupy or jax, there is no problem.

The output of $ dask info versions is:

{
  "Python": "3.10.10",
  "Platform": "Linux",
  "dask": "2024.2.0",
  "distributed": "2024.2.0",
  "numpy": "1.26.4",
  "pandas": "2.2.0",
  "cloudpickle": "3.0.0",
  "fsspec": "2024.2.0",
  "bokeh": "3.3.4",
  "pyarrow": null,
  "zarr": null
}

and I am using xarray version 2024.2.0.

jrbourbeau commented 8 months ago

Hmm that's interesting. I am able to reproduce when I install cupy-cuda11x=13.0.0 and jax=0.4.25. I'm not sure what the problematic dictionary is here that sizeof can't handle

cc @crusaderky @charlesbluca @quasiben in case someone has bandwidth to dig in a bit

tuckerbuchy commented 7 months ago

For what its worth, I'm experiencing this same issue when dealing with a large number of geojson formatted dictionaries. Not sure if that is a specific cause here or not, but have started having the same error as in the original post.

alessioarena commented 6 months ago

I'm also experiencing this issue. In my case the computation is much slower if requesting the performance report but I suspect is something to do with Jupyter struggling to handle the size of the output stream.

my dask info versions

{
  "Python": "3.10.12",
  "Platform": "Linux",
  "dask": "2023.10.0",
  "distributed": "2023.10.0",
  "numpy": "1.23.4",
  "pandas": "2.1.1",
  "cloudpickle": "3.0.0",
  "fsspec": "2023.9.2",
  "bokeh": "3.2.2",
  "fastparquet": "2023.8.0",
  "pyarrow": "13.0.0",
  "zarr": "2.16.1"
}
enrico-mi commented 4 days ago

FWIW, I encounter the same bug testing my code as:

with performance_report(filename="test_performance.html"):
    my_script.my_method("file.parquet")

It disappears if I call my script multiple times on multiple files as in:

with performance_report(filename="test_performance.html"):
    for s in range(10):
        my_script.my_method("file_"+str(s)+".parquet")

My dask info version:

{
  "Python": "3.9.10",
  "Platform": "Linux",
  "dask": "2024.7.1",
  "distributed": "2024.7.1",
  "numpy": "1.26.4",
  "pandas": "2.2.2",
  "cloudpickle": "3.1.0",
  "fsspec": "2023.10.0",
  "bokeh": "3.4.3",
  "pyarrow": "16.1.0",
  "zarr": null
}