dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 717 forks source link

excessive CPU time spent on gc (even after manually adjusting gc thresholds) #2801

Open olly-writes-code opened 5 years ago

olly-writes-code commented 5 years ago

During the process I'm running I very quickly get this warning (for pretty much each worker)

distributed.utils_perf - WARNING - full garbage collections took 36% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)

However, the percent of memory used is by each worker is low -

image

I have plenty of memory to work with so my question is - is it possible dask is aggressively doing gc? If so is it possible to change the gc threshold/collect less aggressively?

I've tried to manually adjust the gc.threshold but this seems to have no effect

g0, g1, g2 = gc.get_threshold()
gc.set_threshold(g0*10, g1*10, g2*10)

(using distributed 1.28.1 and dask 1.2.2)

mrocklin commented 5 years ago

I'm curious, what kind of workload are you running? That's a large amount of time spent in garbage collection.

If so is it possible to change the gc threshold/collect less aggressively?

I personally don't know much about how we handle GC, but you might want to look through the following file, which I think includes the code that's active here.

https://github.com/dask/distributed/blob/master/distributed/utils_perf.py

olly-writes-code commented 5 years ago

@mrocklin - sorry for the slow response. I'm running lots of parallel simulations. The task is looping through a large array of objects. Adjusting the objects and storing the simulation state. At the end of the simulation, an overall metric is stored, and the large dataset used is disregarded (probably garbage collected).

olly-writes-code commented 5 years ago

in distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%) is the threshold the warning threshold or the threshold for which GC occurs?

mrocklin commented 5 years ago

warning. Python's GC runs fairly frequently.

muammar commented 4 years ago

I am experiencing the same on my workstation:

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://127.0.0.1:58788 remote=tcp://127.0.0.1:33619>
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
 cdistributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

It's been at least 20 min printing these warnings and no computation has been done as far as I can say.

Edit 1: After about 30 min it started doing stuff:

Screen Shot 2019-09-27 at 18 56 09

Edit 2: I forgot to mention that I am creating a list of delayed objects that are then called by dask.persist. See here.

mrocklin commented 4 years ago

You have lots of tasks relative to how many workers you have. You might want to look through https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

mrocklin commented 4 years ago

It's also worth noting that this error message

distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

is most often (but not exclusively) the fault of the code that you're running, and not anything to do with Dask. Dask is just in a nice position to let you know if things like that are going on.

jklen commented 3 years ago

Hi,

I am starting with dask on my laptop, going through the dask-tutorial which is on github, but I have similar issue with "distributed.utils_perf - WARNING - full garbage collections took xxx% CPU time recently (threshold: 10%)" always when I start local cluster with the distributed scheduler

Is there any progress on this pls,, or any way to solve this?

arnabbiswas1 commented 3 years ago

@jklen Which version of the dask are you using? Is it 2020.12.0? If yes, could you please downgrade to 2.30.0?

I was encountering this issue after upgrading to the latest version ("2020.12.0"), but after down grading these warning messages didn't appear. But, of course, I am not sure.

andrii29 commented 3 years ago

hi, have similar problem - use cluster daskdev/dask:2.30.0 with 1 scheduler and 3 workers (deployed via helm chart). Cluster currently is in idle state - just some test tasks form time to time. In few days scheduler use a lot of resources (in idle cluster...) and start spamming warnings

distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.11s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)

Also i use web checks for dashboard - they start to fail in few days after scheduler start too.

scheduler:
  image:
    repository: "daskdev/dask"
    tag: 2.30.0
  resources:
    limits:
      cpu: 1.8
      memory: 6G
    requests:
      cpu: 1.8
      memory: 6G

Scheduler utilization in idle state image Scheduler cpu usage graph image Scheduler memory usage graph image

benjaminvatterj commented 3 years ago

Hi, just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:

    def _gc_callback(self, phase, info):
        # Young generations are small and collected very often,
        # don't waste time measuring them
        if info["generation"] != 2:
            return

For reference, my code is:

from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings
mjspeck commented 3 years ago

I get the same messages, and I don't think it could be the fault of my code, as it happens when just using .unique() on a dask.dataframe.core.Series object. All of a sudden I'm continually bombarded with this message, and over time the CPU time percentage slowly trickles up.

my_df["my_column"].unique().compute()

gives this output:

full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)

The task doesn't seem difficult, as it completes in a matter of seconds.

RayAtUofT commented 3 years ago

I have get the same problem while processing large sets of data. For my case, the warning did not affect my output.

I somehow get away by allocating more workers to the job (used to be only one worker).

I am not sure about the implications and consequences (I also haven't test the speed before and after).

Hope that helps.

LiutongZhou commented 2 years ago

Same problem

acnash commented 2 years ago

I'm running into the same problem whilst running modin.pandas. I've installed dask version 2022.1.1 for python 3.9. Seems like dask grinds my calculations to a slow trickle.

kahemker commented 1 year ago

Just disable garbage collection. You may need to restart workers as memory fills

gc.disable()

sudarshan85 commented 1 year ago

Hi,

I'm having the same problem. I'm on a system with 64 cores around 1TB of memory. I'm only using 32 cores but I'm working with a VERY LARGE dataset (~1.3TB) of text and doing some preprocessing using SpaCy. My dask version is 2022.7.0. I'm working on a Jupyter notebook and I just get this warning and even after more than hour, I don't see any progress on the dashboard. I tried disabling the garbage collection, but that didn't work for me. Any other workarounds or possible solutions?

fjetter commented 1 year ago

I don't see any progress on the dashboard.

@sudarshan85 It's very likely that you are encountering a different issue and the GC warning log is unrelated. I suggest to try a more recent version of dask and open a new issue if this still persists. If there is no progress at all, this is a critical issue we'd like to get to the bottom of

Danferno commented 1 year ago

Is there a way to change the threshold? Or just disable this specific warning? I don't want to set logging to errors only (because other warnings might be relevant)

DeusNexus commented 11 months ago

The solution I found is to restart the client before every compute.

# Clean up old garbage client.restart() ... Your code

ebo commented 8 months ago

I have been bitten by this as well. My use case is processing pairs of ICESat-2 and CryoSat-2 data files to extract overlapping sensor data that occur within a narrow time window. The task is treated as embarrassingly parallel, and the processing function is given one CryoSat-2 filename and one or more ICESat-2 filenames to process, but only one IS2 and CS2 files are open at any given time.

One thing I have done to help is to set threads_per_worker=1. With threads_per_worker=2 it gives me these warnings after about 2% processing. Even with threads_per_worker=1, it happens, but appear to happen much further along.

It would be nice to find a proper resolution for this.

I will also add that I just created a new clean environment and reinstalled dask, etc., which installed version 2023.12.1

DanyaLearning commented 7 months ago

Same issue here

ebo commented 7 months ago

The solution I found is to restart the client before every compute.

Clean up old garbage client.restart() ... Your code

@DeusNexus, Are you saying that you are doing this inside of the futures function map? Is that safe? I think I am misunderstanding your solution.