flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.42k stars 581 forks source link

[BUG] Core dump when running kf-pytorch plugin tests on python 3.12 #5020

Open eapolinario opened 6 months ago

eapolinario commented 6 months ago

Describe the bug

As the title says. Tests for the kf-pytorch plugin core dumps:

platform linux -- Python 3.12.2, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/flytekit/flytekit
configfile: pyproject.toml
plugins: mock-3.12.0, timeout-2.2.0, xdist-3.5.0, asyncio-0.23.5, cov-4.1.0, hypothesis-6.98.17
asyncio: mode=Mode.STRICT
collected 16 items

Fatal Python error: Segmentation fault

Current thread 0x00007ffabc917b80 (most recent call first):
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
  File "/home/runner/work/flytekit/flytekit/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py", line 393 in _execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/exceptions/scopes.py", line [21](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:22)9 in user_entry_point
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/exceptions/scopes.py", line 143 in f
  File "/home/runner/work/flytekit/flytekit/plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py", line 428 in execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/base_task.py", line 675 in dispatch_execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/base_task.py", line 388 in sandbox_execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/base_task.py", line 308 in local_execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/promise.py", line 1195 in flyte_entity_call_handler
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/base_task.py", line 331 in __call__
  File "/home/runner/work/flytekit/flytekit/plugins/flytekit-kf-pytorch/tests/test_elastic_task.py", line 57 in wf
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/exceptions/scopes.py", line 212 in user_entry_point
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/exceptions/scopes.py", line 143 in f
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/workflow.py", line 796 in execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/workflow.py", line 312 in local_execute
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/promise.py", line 1203 in flyte_entity_call_handler
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/flytekit/core/workflow.py", line 287 in __call__
  File "/home/runner/work/flytekit/flytekit/plugins/flytekit-kf-pytorch/tests/test_elastic_task.py", line 59 in test_end_to_end
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/python.py", line 1831 in runtest
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 170 in pytest_runtest_call
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 263 in <lambda>
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 342 in from_call
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 262 in call_runtest_hook
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line [22](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:23)3 in call_and_report
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/main.py", line 352 in pytest_runtestloop
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/main.py", line 327 in _main
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/main.py", line 273 in wrap_session
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/main.py", line 320 in pytest_cmdline_main
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/_pytest/config/__init__.py", line 198 in console_main
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/pytest/__main__.py", line 7 in <module>
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/coverage/execfile.py", line 211 in run
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/coverage/cmdline.py", line 858 in do_run
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/coverage/cmdline.py", line 681 in command_line
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/coverage/cmdline.py", line 970 in main
  File "/opt/hostedtoolcache/Python/3.12.2/x64/bin/coverage", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, google._upb._message, yaml._yaml, charset_normalizer.md, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, grpc._cython.cygrpc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist (total: 74)
/home/runner/work/_temp/5[23](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:24)b2bb8-cd9d-401f-84[33](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:34)-7901fde776[54](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:55).sh: line 8:  19[79](https://github.com/flyteorg/flytekit/actions/runs/8182775035/job/22374791820?pr=2237#step:7:80) Segmentation fault      (core dumped) coverage run -m pytest tests --cov=./ --cov-report=xml --cov-append
tests/test_elastic_task.py::test_end_to_end[spawn]

Expected behavior

Tests run on python 3.12

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot[bot] commented 6 months ago

Hey @eapolinario, fancy seeing you here again! Hope you're doing well despite this tricky bug.

Based on the information you've provided, it seems like you're experiencing a segmentation fault when running the test_end_to_end function from the test_elastic_task.py file. Segmentation faults are usually caused by a program trying to read or write an illegal memory location.

Given that the error occurs in the c10d_rendezvous_backend.py file, it's possible that the issue might be related to PyTorch's distributed package. This package is used for elastic training, which allows the training process to be resilient to failures and resource changes.

However, without more specific information about the error, it's hard to pinpoint the exact cause. It would be helpful if you could provide more details about the error message, such as the line number where the error occurs and the full stack trace.

In the meantime, you might want to check the versions of your dependencies. The test_end_to_end function requires several dependencies to run successfully, including Python 3.11, pytest-8.0.2, pluggy-1.4.0, and others. If you're using Python 3.12.2, it's possible that there might be some compatibility issues.

You could try downgrading your Python version to 3.11 and see if the error still occurs. If it does, please provide more details about the error so we can further investigate the issue.

For more information about the test_end_to_end function and its dependencies, you can refer to the .github/workflows/end2end.yml file in the Flyte repository. This file contains the workflow for running end-to-end tests, including the test_end_to_end function.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.