dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.97k stars 1.5k forks source link

`dagster-webserver` memory leak #18997

Open aaaaahaaaaa opened 11 months ago

aaaaahaaaaa commented 11 months ago

Dagster version

1.5.13

What's the issue?

dagster-webserver 1.5.13 seems to have some kind of memory leak. Since we updated to that version, we can observe a steady increase in memory usage over the last couple of weeks.

image image

What did you expect to happen?

No response

How to reproduce?

No response

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a šŸ‘! We factor engagement into prioritization.

alangenfeld commented 11 months ago

I don't see any notable commits in 1.5.13 on initial inspection

Reverting to 1.5.12 resolves the issue.

How exactly did you do this? Can you report the python environments in the two containers (pip list / pip freeze) ? Trying to discern if its possible that the leak is from a dependency that also changed between the two container images.

aaaaahaaaaa commented 11 months ago

How exactly did you do this?

We changed the helm chart version. We literally just reverted the Renovate bot commit.

1.5.12

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.33.12
botocore                    1.33.12
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.12
dagster-aws                 0.21.12
dagster-azure               0.21.12
dagster-celery              0.21.12
dagster-celery-k8s          0.21.12
dagster-gcp                 0.21.12
dagster-graphql             1.5.12
dagster-k8s                 0.21.12
dagster-pandas              0.21.12
dagster-pipes               1.5.12
dagster-postgres            0.21.12
dagster-webserver           1.5.12
db-dtypes                   1.1.1
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.110.0
google-auth                 2.25.2
google-auth-httplib2        0.1.1
google-cloud-bigquery       3.13.0
google-cloud-core           2.4.1
google-cloud-storage        2.13.0
google-crc32c               1.5.0
google-resumable-media      2.6.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
grpcio-status               1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.41
proto-plus                  1.23.0
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.8.2
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.33.12
botocore==1.33.12
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.12
dagster-aws==0.21.12
dagster-azure==0.21.12
dagster-celery==0.21.12
dagster-celery-k8s==0.21.12
dagster-gcp==0.21.12
dagster-graphql==1.5.12
dagster-k8s==0.21.12
dagster-pandas==0.21.12
dagster-pipes==1.5.12
dagster-postgres==0.21.12
dagster-webserver==1.5.12
db-dtypes==1.1.1
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.110.0
google-auth==2.25.2
google-auth-httplib2==0.1.1
google-cloud-bigquery==3.13.0
google-cloud-core==2.4.1
google-cloud-storage==2.13.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
grpcio-status==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.41
proto-plus==1.23.0
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.8.2
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4

1.5.13

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.34.0
botocore                    1.34.0
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.13
dagster-aws                 0.21.13
dagster-azure               0.21.13
dagster-celery              0.21.13
dagster-celery-k8s          0.21.13
dagster-gcp                 0.21.13
dagster-graphql             1.5.13
dagster-k8s                 0.21.13
dagster-pandas              0.21.13
dagster-pipes               1.5.13
dagster-postgres            0.21.13
dagster-webserver           1.5.13
db-dtypes                   1.2.0
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.111.0
google-auth                 2.25.2
google-auth-httplib2        0.2.0
google-cloud-bigquery       3.14.1
google-cloud-core           2.4.1
google-cloud-storage        2.14.0
google-crc32c               1.5.0
google-resumable-media      2.7.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.43
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.9.0
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.34.0
botocore==1.34.0
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.13
dagster-aws==0.21.13
dagster-azure==0.21.13
dagster-celery==0.21.13
dagster-celery-k8s==0.21.13
dagster-gcp==0.21.13
dagster-graphql==1.5.13
dagster-k8s==0.21.13
dagster-pandas==0.21.13
dagster-pipes==1.5.13
dagster-postgres==0.21.13
dagster-webserver==1.5.13
db-dtypes==1.2.0
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.111.0
google-auth==2.25.2
google-auth-httplib2==0.2.0
google-cloud-bigquery==3.14.1
google-cloud-core==2.4.1
google-cloud-storage==2.14.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.9.0
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4
alangenfeld commented 11 months ago

Thanks for following up, not much interesting in the dependency changes.

I spent some time with memray looking for leaks and have so far not been able to turn anything up.

Do you have anything like automated recurring queries against the webserver?

aaaaahaaaaa commented 11 months ago

Do you have anything like automated recurring queries against the webserver?

Well only the readinessProbe from your chart.

Turns out we actually still observe the same behaviour after rolling back to 1.5.12. So it's not related to the new version. I'm puzzled now. I'll try to investigate further and close the issue.

alangenfeld commented 11 months ago

I've had luck using this tool to get a memory profile of a running process https://github.com/facebookarchive/memory-analyzer and this https://github.com/kmaork/madbg for interactive poking around at the active process. I believe these both need SYS_PTRACE capabilities given on the k8s pod spec.

Given its a webserver its also susceptible to the "type 3" leaks described here https://blog.nelhage.com/post/three-kinds-of-leaks/ python allocator arena fragmentation, but the very smooth gradient of your graphs makes me skeptical thats the cause without some sort of recurring large query causing the fragmentation.

jvyoralek commented 10 months ago

@aaaaahaaaaa did you find any reason why memory started growing? We have a similar issue and switching between versions didn't help yet - tried from 1.5.14 to 1.5.12.

The memory increase is quite noticeable, showing up even in daily granularity.

This issue seems to be isolated to the webserver component. Both the daemon and code servers are exhibiting stable memory usage. We are operating these as three separate containers within AWS ECS.

We have only one scheduled job active, no sensors, auto-materialized so far. Assets are loaded from dbt.

SCR-20240119-iqbx
aaaaahaaaaa commented 10 months ago

@jvyoralek No I didn't find the source of the problem and the issue is still occurring for us as well. Unfortunately I didn't have time to investigate further. I think there's clearly something up with the workload, we're not doing anything special either aside from deploying the helm chart.

salazarm commented 10 months ago

@alangenfeld found a memory leak that could be the cause of this, I'll let him comment but here is the PR that attempts to fix it https://github.com/dagster-io/dagster/pull/19298

alangenfeld commented 10 months ago

https://github.com/dagster-io/dagster/pull/19298 is a fix for a problem that manifests as very rapid unbounded memory growth resulting in process termination. I don't believe its related to this slower memory growth.

noam-jacobson commented 10 months ago

I appear to have a similar problem after upgrading to 1.6. I run Dagster on AWS ECS using Fargate. Hence I don't believe it is my jobs causing it since the code runs on a separate task. Both the Daemon and Dagit/Web server, services, are slowly creeping up. The drops in the following chart is due to restarts. Before the upgrade to 1.6 on the 11th this problem didn't exist. image

alangenfeld commented 10 months ago

@noam-jacobson what version were you upgrading from?

noam-jacobson commented 10 months ago

@noam-jacobson what version were you upgrading from?

I was on version 1.5.10

jackwillisupside commented 10 months ago

@noam-jacobson We're having the same issue on ECS/Fargate on 1.5.7

will-regal-voice commented 10 months ago

We are also having the same issue on 1.6.0, also ECS/Fargate

gasgallo commented 10 months ago

Same here in our k8s deployment cluster. Any clue?

jackwillisupside commented 9 months ago

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365
gasgallo commented 9 months ago

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

How did that impact your memory usage? Technically you'll still retain ticks for up to 365 days, thus you should not see a change in behavior in just a few days. Or did I miss something?

I've applied a similar setting on my deployment as well (way stricter than yours, for testing) and my memory is still going up, same as before.

alexknorr commented 9 months ago

Same problem here on Open-Shift with nearly same packages (dagster 1.6.5), also PostgreSQL and slim-buster images on both daemon and dagster-webserver (separate pods). Tried with python 3.10, 3.11 and sqlalchemy<2.0 + >2.0, no luck so far, crashes every 3-4 days. Currently trying with python 3.12, dagster 1.6.6 and slim-bookworm, will see more next days...

stasharrofi commented 9 months ago

EDIT: We found out that the following is actually not working. The initial indication might have just been a fluke.

~We were having this issue and I believe that we have found the root cause to be a bug in anyio which leaked processes. The bug was introduced in 4.1.0 and fixed in 4.3.0 (last week): https://github.com/agronholm/anyio/issues/669~

~Dagster has a dependency on anyio through the following chain: dagit --> dagster-webserver --> starlette --> anyio and I believe that this issue started to appear for people whenever they rebuilt their Dagster image during the time that bug was present because a newer but buggy version of anyio would have been included in their docker image.~

~So, the solution could be to either explicitly require anyio >= 4.3.0 or to wait until people rebuild their docker images and automatically get the bug-fixed version.~

jvyoralek commented 9 months ago

Has anyone had success with the solution recommended by @stasharrofi ?

We have made changes, but it appears that the memory usage is still increasing.

image

I see anyio 4.3 in log

#12 1.757 Collecting dagster==1.6.6
#12 1.810   Downloading dagster-1.6.6-py3-none-any.whl (1.4 MB)
#12 1.852      ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā” 1.4/1.4 MB 36.1 MB/s eta 0:00:00
#12 2.037 Collecting dagster-aws==0.22.6
#12 2.042   Downloading dagster_aws-0.22.6-py3-none-any.whl (109 kB)
#12 2.048      ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā”ā” 109.8/109.8 kB 32.6 MB/s eta 0:00:00
#12 2.214 Collecting dagster-postgres==0.22.6
#12 2.219   Downloading dagster_postgres-0.22.6-py3-none-any.whl (20 kB)
#12 2.259 Collecting anyio==4.3.0
#12 2.263   Downloading anyio-4.3.0-py3-none-any.whl (85 kB)
noam-jacobson commented 9 months ago

@jvyoralek It hasn't worked for me. Deployed the newest Dagster version 1.6.6 with anyio-4.3.0.

stasharrofi commented 9 months ago

@jvyoralek : No, we found out that it's not working for us either. The initial indication that it was working was probably just a fluke.

shivonchain commented 8 months ago

Same issue here with an ECS deployment, packages and versions included below

image

dagster==1.6.10
dagster-graphql==1.6.10
dagster-webserver==1.6.10
dagster-postgres==0.22.10
dagster-docker==0.22.10
jobicarter commented 7 months ago

My team experienced this issue in an OSS ECS deployment after an upgrade from 1.5.9 -> 1.6.8. It impacted the dagit/webserver and daemon services, but not independent grpc/code location services. It presented as a slow leak that would increase memory utilization over a week or so until hitting critical thresholds / crashing the service, with 1gb memory allocated to services.

We "resolved" the issue in our environments by downgrading and pinning the grpcio python package to 1.57.0.

In incremental tests we downgraded our docker image base to the image version/sha we used for our 1.5.9 deployment, reverted dagster packages from 1.6.8 back to 1.5.9, and updated python from 3.10 -> 3.11. None of these changes resolved the memory leak.

Sharing this context as it supports root cause being related to an unpinned package dependency, and not necessarily an issue with the core dagster packages. It also ruled out interaction with OS libs/OS version causing the leak.

We selected grpcio 1.57.0 because it was the version of the dep that was solved for at the time when we originally deployed 1.5.9. It's possible a more recent version would work as well.

jvyoralek commented 7 months ago

Thank you, @jobicarter, for the effective workaround. We deployed it yesterday, and although it's only been a short time, we're already seeing promising changes.

Tested with these versions:

dagster==1.7.0
dagster-webserver==1.7.0
dagster-graphql==1.7.0
dagster-aws==0.23.0
dagster-postgres==0.23.0
grpcio==1.57.0
image
csomh commented 7 months ago

I can confirm that downgrading grpcio to 1.57.0 stops the leak.

dagster==1.5.14
dagster-aws==0.21.14
dagster-azure==0.21.14
dagster-celery==0.21.14
dagster-celery-k8s==0.21.14
dagster-gcp==0.21.14
dagster-graphql==1.5.14
dagster-k8s==0.21.14
dagster-pandas==0.21.14
dagster-pipes==1.5.14
dagster-postgres==0.21.14
dagster-webserver==1.5.14
grpcio==1.57.0
grpcio-health-checking==1.57.0

We also did try to upgrade it to 1.62.1, but that didn't seem to work.

G14rb commented 7 months ago

Thanks for the solution, I think this could be related to the dagster issue, https://github.com/grpc/grpc/issues/36117

p-y-t-h-e-c commented 6 months ago

Hi All, Having similar issue with the Dagster Docker deployment to Oracle VM. Unfortunately downgrading grpcio to 1.57.0 version hasn't resolved the issue. Currently using following setup for the Dagster image. Screenshot 2024-05-14 134923 VM seems to get to OOM state circa every 8hrs now.

rensoostenbachBL commented 4 months ago

We are running into the same issue on our Kubernetes cluster, having installed Dagster via the Helm chart.

Is the solution to downgrade grpcio for the dagster-webserver pod? In that case, we should build a custom Dockerfile that changes the dependencies and point to that Dockerfile in the Helm chart right?

I don't understand why Dagster hasn't pinned the grpcio version themselves to prevent this issue from happening, it seems a little strange that they are expecting users to either live with the memory leak, or manually fix the dependencies themselves.

JanEgner commented 2 months ago

Just to add my 2 cents': running dagster 1.7.16/dbt/dagster-webserver all in one k8s pod.

image

I admit that it is somewhat inconclusive since some memory increase (but also a kind of garbage collection releasing much of the extra memory at a point) was visible before the last restart while using grpcio 1.57.0. Still, overall it looks way better than with grpcio 1.60.

It seems to be a workaround for now, but with at least two drawbacks (other than using an outdated component at all):

bolinzzz commented 1 month ago

We started noticing memory leaks in certain code locations after upgrading to Dagster 1.8. Could grpcio potentially be contributing to these leaks?

We're still investigating, but Iā€™d like to rule out this possibility.

babaMar commented 3 weeks ago

We're observing the same behavior deploying Dagster via Helm on K8s cluster. Building a custom image and downgrading grpcio seems like a step back to be honest.

auguste-elax commented 2 weeks ago

hi there šŸ‘‹ I'm also experiencing what looks like memory leaks on specifically the webserver and daemon (running dagster 1.7.7 on k8s with helm). Has a solution been found apart from downgrading and pinning grpcio version ?