Open olly-writes-code opened 5 years ago
I'm curious, what kind of workload are you running? That's a large amount of time spent in garbage collection.
If so is it possible to change the gc threshold/collect less aggressively?
I personally don't know much about how we handle GC, but you might want to look through the following file, which I think includes the code that's active here.
https://github.com/dask/distributed/blob/master/distributed/utils_perf.py
@mrocklin - sorry for the slow response. I'm running lots of parallel simulations. The task is looping through a large array of objects. Adjusting the objects and storing the simulation state. At the end of the simulation, an overall metric is stored, and the large dataset used is disregarded (probably garbage collected).
in distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
is the threshold the warning threshold or the threshold for which GC occurs?
warning. Python's GC runs fairly frequently.
I am experiencing the same on my workstation:
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://127.0.0.1:58788 remote=tcp://127.0.0.1:33619>
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
cdistributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
It's been at least 20 min printing these warnings and no computation has been done as far as I can say.
Edit 1: After about 30 min it started doing stuff:
Edit 2: I forgot to mention that I am creating a list of delayed objects that are then called by dask.persist
. See here.
You have lots of tasks relative to how many workers you have. You might want to look through https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs
It's also worth noting that this error message
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
is most often (but not exclusively) the fault of the code that you're running, and not anything to do with Dask. Dask is just in a nice position to let you know if things like that are going on.
Hi,
I am starting with dask on my laptop, going through the dask-tutorial which is on github, but I have similar issue with "distributed.utils_perf - WARNING - full garbage collections took xxx% CPU time recently (threshold: 10%)" always when I start local cluster with the distributed scheduler
Is there any progress on this pls,, or any way to solve this?
@jklen Which version of the dask are you using? Is it 2020.12.0? If yes, could you please downgrade to 2.30.0?
I was encountering this issue after upgrading to the latest version ("2020.12.0"), but after down grading these warning messages didn't appear. But, of course, I am not sure.
hi, have similar problem - use cluster daskdev/dask:2.30.0 with 1 scheduler and 3 workers (deployed via helm chart). Cluster currently is in idle state - just some test tasks form time to time. In few days scheduler use a lot of resources (in idle cluster...) and start spamming warnings
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.38s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
distributed.core - INFO - Event loop was unresponsive in Scheduler for 12.11s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.utils_perf - WARNING - full garbage collections took 33% CPU time recently (threshold: 10%)
Also i use web checks for dashboard - they start to fail in few days after scheduler start too.
scheduler:
image:
repository: "daskdev/dask"
tag: 2.30.0
resources:
limits:
cpu: 1.8
memory: 6G
requests:
cpu: 1.8
memory: 6G
Scheduler utilization in idle state Scheduler cpu usage graph Scheduler memory usage graph
Hi, just to add, I'm getting the same errors while repartitioning a parquet file. I suspect that it has more to do with the computation being very quick and the garbage collection just being the more burdensome task. Maybe the solution is to add to only warn if GC takes a long fraction of a long task? It seems to be the idea behind the first comment in _gc_callback in distributed.utils_perf which emits this warning:
def _gc_callback(self, phase, info):
# Young generations are small and collected very often,
# don't waste time measuring them
if info["generation"] != 2:
return
For reference, my code is:
from dask.distributed import Client
client = Client()
# outputs: <Client: 'tcp://127.0.0.1:45451' processes=7 threads=28, memory=48.32 GB>
import dask.dataframe as dataframe
df = dataframe.read_parquet('data_simulated_partitioned.parquet')
df.npartitions
# 3941
df = df.repartition(partition_size='100MB')
# This is were hundreds of warnings arise
df.npartitions
# 137
df = df.persist()
# Again, hundreds of warnings
I get the same messages, and I don't think it could be the fault of my code, as it happens when just using .unique()
on a dask.dataframe.core.Series
object. All of a sudden I'm continually bombarded with this message, and over time the CPU time percentage slowly trickles up.
my_df["my_column"].unique().compute()
gives this output:
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 16% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
full garbage collections took 17% CPU time recently (threshold: 10%)
The task doesn't seem difficult, as it completes in a matter of seconds.
I have get the same problem while processing large sets of data. For my case, the warning did not affect my output.
I somehow get away by allocating more workers to the job (used to be only one worker).
I am not sure about the implications and consequences (I also haven't test the speed before and after).
Hope that helps.
Same problem
I'm running into the same problem whilst running modin.pandas. I've installed dask version 2022.1.1 for python 3.9. Seems like dask grinds my calculations to a slow trickle.
Just disable garbage collection. You may need to restart workers as memory fills
gc.disable()
Hi,
I'm having the same problem. I'm on a system with 64 cores around 1TB of memory. I'm only using 32 cores but I'm working with a VERY LARGE dataset (~1.3TB) of text and doing some preprocessing using SpaCy. My dask version is 2022.7.0. I'm working on a Jupyter notebook and I just get this warning and even after more than hour, I don't see any progress on the dashboard. I tried disabling the garbage collection, but that didn't work for me. Any other workarounds or possible solutions?
I don't see any progress on the dashboard.
@sudarshan85 It's very likely that you are encountering a different issue and the GC warning log is unrelated. I suggest to try a more recent version of dask and open a new issue if this still persists. If there is no progress at all, this is a critical issue we'd like to get to the bottom of
Is there a way to change the threshold? Or just disable this specific warning? I don't want to set logging to errors only (because other warnings might be relevant)
The solution I found is to restart the client before every compute.
# Clean up old garbage
client.restart()
...
Your code
I have been bitten by this as well. My use case is processing pairs of ICESat-2 and CryoSat-2 data files to extract overlapping sensor data that occur within a narrow time window. The task is treated as embarrassingly parallel, and the processing function is given one CryoSat-2 filename and one or more ICESat-2 filenames to process, but only one IS2 and CS2 files are open at any given time.
One thing I have done to help is to set threads_per_worker=1. With threads_per_worker=2 it gives me these warnings after about 2% processing. Even with threads_per_worker=1, it happens, but appear to happen much further along.
It would be nice to find a proper resolution for this.
I will also add that I just created a new clean environment and reinstalled dask, etc., which installed version 2023.12.1
Same issue here
The solution I found is to restart the client before every compute.
Clean up old garbage
client.restart()
... Your code
@DeusNexus, Are you saying that you are doing this inside of the futures function map? Is that safe? I think I am misunderstanding your solution.
During the process I'm running I very quickly get this warning (for pretty much each worker)
However, the percent of memory used is by each worker is low -
I have plenty of memory to work with so my question is - is it possible dask is aggressively doing gc? If so is it possible to change the gc threshold/collect less aggressively?
I've tried to manually adjust the
gc.threshold
but this seems to have no effect(using distributed 1.28.1 and dask 1.2.2)