axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.89k stars 869 forks source link

When running in a Python virtual environment, I get a segmentation fault (core dumped) error. #1896

Closed KeonwooChoi closed 2 months ago

KeonwooChoi commented 2 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Creating a virtual environment using 'python -m venv axolotl' and installing as shown below should result in normal operation when executed.

git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl

pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'
(axolotl) jovyan@cheetah-6b69737465706178-6lq5hu-75d9897cb9-42h5d:~/data$ python
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import faulthandler; faulthandler.enable()
>>> import axolotl
>>> axolotl
<module 'axolotl' from '/home/jovyan/data/axolotl_code/axolotl/src/axolotl/__init__.py'>

Current behaviour

When I enter 'axolotl.cli', I get the following error

>>> from axolotl.cli import train
Fatal Python error: Segmentation fault

Current thread 0x00007fe6e1513480 (most recent call first):
  File "/home/jovyan_venv/.venv/axolotl/lib/python3.11/site-packages/torch/jit/_script.py", line 1399 in script
  File "/home/jovyan/data/axolotl_code/axolotl/src/axolotl/monkeypatch/utils.py", line 15 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/jovyan/data/axolotl_code/axolotl/src/axolotl/monkeypatch/multipack.py", line 10 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/jovyan/data/axolotl_code/axolotl/src/axolotl/utils/models.py", line 42 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/jovyan/data/axolotl_code/axolotl/src/axolotl/common/cli.py", line 12 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/jovyan/data/axolotl_code/axolotl/src/axolotl/cli/__init__.py", line 29 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "<stdin>", line 1 in <module>

Extension modules: zstandard.backend_c, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json (total: 86)

Steps to reproduce

  1. python -m axolotl.cli.train

Config yaml

No response

Possible solution

I'm using a provided Jupyter container, CUDA version is 12.1 and torch is 2.3.1. It could be a conflict between Python modules, but I'm not sure how to resolve it. Do you have any ideas?

Which Operating Systems are you using?

Python Version

3.11

axolotl branch-commit

main/dca1fe4

Acknowledgements

KeonwooChoi commented 2 months ago

I solved this issue by conda virtual environment