Closed joelluijmes closed 1 year ago
I think dask and kubernetes are unlikely to be the culprits here, rather I suspect your CreateSyncJob
(or the RunNamespacedJob
) task accumulates memory over time. Since these tasks are mostly IO bound, you might try running with a LocalDaskExecutor
instead to eliminate dask.distributed
from consideration.
# Swapping out executor for this should do it
executor = LocalDaskExecutor(num_workers=6)
I just tried it with the LocalDaskExecutor
, and the initial results are much more promising. Although the job is only running for 10 minutes so far, the memory usage is much lower and actually constant.
Here is another screenshot of flow which used DaskExecutor which ran for many hours until it finally was OOMKilled:
So it does seem something may be off with dask?
--
Anyhow for reference here is my implementation of the CreateSyncJob
. It takes in a templated Kubernetes job spec, which is filled in at template_job_spec
(i.e. setting name of the job, environment variables etc), and then calls RunNamespacedJob
.
class CreateSyncJob(Task):
"""
Task which runs Kubernetes job which sync database from CloudSQL to BigQuery.
Args:
- job_spec_template (str): content of sql_to_bigquery.job.yaml.
- spec (pd.Series): row of parse_specs_to_pandas containing sync variables.
"""
def __init__(self, job_spec_template, spec, **kwargs):
"""
Task which runs Kubernetes job which sync database from CloudSQL to BigQuery.
Args:
- job_spec_template (str): content of sql_to_bigquery.job.yaml.
- spec (pd.Series): row of parse_specs_to_pandas containing sync variables.
"""
self.job_spec_template = job_spec_template
self.spec = spec
super().__init__(
name=spec["sourceDatabase"],
**kwargs,
)
def run(self, cloudsql_name: Task):
"""
Templates the job spec and creates the Kubernetes job using Prefect's RunNamespacedJob.
Args:
- cloudsql_name (Task): output of the CloudSQLProxyManager (or generate_cloudsql_name).
"""
job_spec = template_job_spec(cloudsql_name, self.job_spec_template, self.spec)
self.logger.info(f"Templated job_spec for {self.name}")
self.logger.debug(job_spec)
RunNamespacedJob(job_spec, namespace="prefect", kubernetes_api_key_secret=None).run()
Is there any benefit for using the DaskExecutor? Otherwise I'm fine with using the LocalDaskExecutor, which uses a ThreadPool instead of dask right? if you need more information to diagnose this, I'm willing to research it 👍
Hmmm, that's interesting. I'd still be surprised if this is a bug in dask. A larger reproducible example I could run locally would definitely help. If you drop the with CloudSQLProxyManager
bit and only run some dummy jobs (say a long sleep task) are you still able to reproduce the issue? If so, that'd help me for debugging locally.
Is there any benefit for using the DaskExecutor? Otherwise I'm fine with using the LocalDaskExecutor, which uses a ThreadPool instead of dask right?
For your flow above, I'd expect using a LocalDaskExecutor
with threads to be sufficient. The tasks you're running are IO bound (so will run fine in threads), and should be low in memory usage. See https://docs.prefect.io/orchestration/flow_config/executors.html#choosing-an-executor for our docs on selecting an executor to use.
👋 We have a similar setup, running Prefect in Kubernetes and using DaskExecutor for our flow with 8 workers. All workers are eating more and more memory until all of them OOM. After a bit of trial and error, we added these env vars to Dask workers yaml, which slows down the leak considerably. Credits to https://stackoverflow.com/questions/63680134/growing-memory-usage-leak-in-dask-distributed-profiler/63680548#63680548
DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms
Other related links:
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.
This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.
Hello,
I think I am seeing this issue and does not seem to be solved by using the environment variables DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
and DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
nor by using the LocalDaskExecutor
.
We are using prefect 1.4
A minimal working example
from prefect import task, Flow, run_configs, storage, executors
from time import sleep
from random import randint
@task()
def mem_n_slp(x):
sleep(x)
# Make the task fail randomly with 5% chance
if randint(0, 100) < 5:
raise ValueError("Random failure")
return x
@task()
def just_slp(x):
sleep(5 - x)
return x
if __name__ == "__main__":
project_name = "test-flows"
flow_name = "test-ram-release"
with Flow(
name=flow_name,
storage=storage.Azure(
container="container-name",
stored_as_script=False,
connection_string_secret="AZURE_STORAGE_CONNECTION_STRING",
),
run_config=run_configs.KubernetesRun(
image="prefecthq/prefect:1.4.0", # We are actually using another image with the same prefect version so apologies if this image does not work,
env={
"PREFECT__CLOUD__USE_LOCAL_SECRETS": True,
"DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL": "10000ms",
"DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE": "1000000ms",
},
image_pull_policy="Always",
cpu_limit="0.5",
cpu_request="0.5",
memory_request="0.5Gi",
memory_limit="0.5Gi",
job_template={
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"ttlSecondsAfterFinished": 600,
"template": {
"metadata": {
"annotations": {
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
},
"labels": {
"project-name": project_name,
"flow-name": flow_name,
},
},
"spec": {
"containers": [{"name": "flow"}],
"nodeSelector": {"prefect_label": "job_pod_node"},
},
},
},
}
)
) as current_flow:
A = mem_n_slp.map(1000 * [1] + 1000 * [2])
B = mem_n_slp.map(A)
C = mem_n_slp.map(B)
D = mem_n_slp.map(C)
E = mem_n_slp.map(D)
F = just_slp.map(E)
current_flow.executor = executors.dask.LocalDaskExecutor(num_workers=250)
current_flow.register(project_name=project_name)
I tried doubling and making half the values of DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
and DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
but no luck.
We are observing a steady increase of memory up to the point that the container gets OOMKilled.
When we use the flow without the KubernetesRun
the flow gets completed
Hi @charalamm - Prefect 1.x is no longer under active development; please upgrade to 2.x or 3.x.
@cicdw Thanks for you response. Is this problem though a known problem with prefect 1.x and is there any known solution?
Description
I’m running Prefect in Kubernetes. What I have is a flow which spawns 15 Kubernetes high CPU intense jobs. In order to do some parallelism, I have in the flow a DaskExecutor configured (6 workers, 1 thread).
What I see is the prefect-job which is created by the Kubernetes agent, uses quite some resources which grows over time. See attached screenshot.
Note: I’m talking about the job created by the prefect agent. The actual job executing code is only using 188MiB.
It seems like there is a possible memory leak in either Prefect or Dask. Is there a better alternative to deploy this? Which uses less resources?
Expected Behavior
Well first of all, I wouldn't expect this high memory usage. But most of all, not that it seems to be growing indefinitely. In this instance, the flow wasn't killed (yet) by Kubernetes due high memory usage, but that is something I was running into.
Reproduction
Reproduction is bit tricky as this flow is generated from a config file on running the script. I think I got all the relevant bits below:
Visualizing one of the flows show:
Environment
Running Prefect Server 0.14.0 on Kubernetes 1.18.x in Google Cloud.
Originally reported on slack: https://prefect-community.slack.com/archives/CL09KU1K7/p1610706804087800