Training error - Githubissues

Hi, I'm trying to kick off lora training using a fresh Install following this page: https://github.com/InternLM/InternLM-XComposer/blob/main/docs/install.md I hope I'm just overlooking something simple. I included my accelerate config. I saw this error show up elsewhere and it feels like a package version issue. There are no requirements.txt to mirror exactly your training environment, maybe if someone will post a pip list on a working training venv that might help too. Any advice would be welcome. Thanks!

$ sh finetune_lora.sh
Traceback (most recent call last):
  File "/mnt/e/Projects/InternLM-XComposer/finetune/finetune.py", line 9, in <module>
    from accelerate.utils import DistributedType
ModuleNotFoundError: No module named 'accelerate'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 202584) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/remote/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/remote/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/remote/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/remote/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/remote/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/remote/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-21_14:59:18
  host      : MyPC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 202584)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

$ accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.29.3
- Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- `accelerate` bash location: /home/remote/anaconda3/envs/intern_clean/bin/accelerate
- Python version: 3.9.19
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 78.48 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/mnt/e/Projects/InternLM-XComposer/finetune/ds_config_zero2.json', 'zero3_init_flag': False}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

$ pip list
Package            Version
------------------ ------------
accelerate         0.29.3
aiohttp            3.9.5
aiosignal          1.3.1
annotated-types    0.6.0
async-timeout      4.0.3
attrs              23.2.0
auto_gptq          0.7.1
certifi            2022.12.7
charset-normalizer 2.1.1
cmake              3.25.0
datasets           2.19.0
deepspeed          0.14.1
dill               0.3.8
filelock           3.9.0
frozenlist         1.4.1
fsspec             2024.3.1
gekko              1.1.1
hjson              3.1.0
huggingface-hub    0.22.2
idna               3.4
Jinja2             3.1.2
lit                15.0.7
MarkupSafe         2.1.3
mpmath             1.3.0
multidict          6.0.5
multiprocess       0.70.16
networkx           3.2.1
ninja              1.11.1.1
numpy              1.24.1
packaging          24.0
pandas             2.2.2
peft               0.10.0
pillow             10.2.0
pip                24.0
psutil             5.9.8
py-cpuinfo         9.0.0
pyarrow            16.0.0
pyarrow-hotfix     0.6
pydantic           2.7.0
pydantic_core      2.18.1
pynvml             11.5.0
python-dateutil    2.9.0.post0
pytz               2024.1
PyYAML             6.0.1
regex              2024.4.16
requests           2.28.1
rouge              1.0.1
safetensors        0.4.3
sentencepiece      0.2.0
setuptools         68.2.2
six                1.16.0
sympy              1.12
tokenizers         0.19.1
torch              2.0.1+cu117
torchaudio         2.0.2+cu117
torchvision        0.15.2+cu117
tqdm               4.66.2
transformers       4.40.0
triton             2.0.0
typing_extensions  4.8.0
tzdata             2024.1
urllib3            1.26.13
wheel              0.41.2
xxhash             3.4.1
yarl               1.9.4


$ python
Python 3.9.19 (main, Mar 21 2024, 17:11:28)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

Thanks for the swift response! I got training working after some tinkering last night and I used my "phone a friend" who helped me iterate through it. I hope this helps others, please close this one out. This seems repeatable so far, working on pure linux and WSL:

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.33.2 timm==0.4.12 sentencepiece==0.1.99 gradio==4.13.0 markdown2==2.4.10 xlsxwriter==3.1.2 einops
pip install deepspeed peft

here are my current package versions:


$ pip3 list
Package                   Version
------------------------- ------------
accelerate                0.29.3
aiofiles                  23.2.1
aiohttp                   3.9.5
aiosignal                 1.3.1
altair                    5.3.0
annotated-types           0.6.0
anyio                     4.3.0
async-timeout             4.0.3
attrs                     23.2.0
auto_gptq                 0.7.1
certifi                   2024.2.2
charset-normalizer        3.3.2
click                     8.1.7
cmake                     3.25.0
contourpy                 1.2.1
cycler                    0.12.1
datasets                  2.19.0
deepspeed                 0.14.1
dill                      0.3.8
einops                    0.7.0
exceptiongroup            1.2.1
fastapi                   0.110.2
ffmpy                     0.3.2
filelock                  3.13.4
fonttools                 4.51.0
frozenlist                1.4.1
fsspec                    2024.3.1
gekko                     1.1.1
gradio                    4.13.0
gradio_client             0.8.0
h11                       0.14.0
hjson                     3.1.0
httpcore                  1.0.5
httpx                     0.27.0
huggingface-hub           0.22.2
idna                      3.7
importlib_resources       6.4.0
Jinja2                    3.1.3
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lit                       15.0.7
markdown-it-py            3.0.0
markdown2                 2.4.10
MarkupSafe                2.1.5
matplotlib                3.8.4
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.5
multiprocess              0.70.16
networkx                  3.2.1
ninja                     1.11.1.1
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.19.3
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.1.105
orjson                    3.10.1
packaging                 24.0
pandas                    2.2.2
peft                      0.10.0
pillow                    10.3.0
pip                       24.0
psutil                    5.9.8
py-cpuinfo                9.0.0
pyarrow                   16.0.0
pyarrow-hotfix            0.6
pydantic                  2.7.0
pydantic_core             2.18.1
pydub                     0.25.1
Pygments                  2.17.2
pynvml                    11.5.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.34.0
regex                     2024.4.16
requests                  2.31.0
rich                      13.7.1
rouge                     1.0.1
rpds-py                   0.18.0
safetensors               0.4.3
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                68.2.2
shellingham               1.5.4
six                       1.16.0
sniffio                   1.3.1
starlette                 0.37.2
sympy                     1.12
timm                      0.4.12
tokenizers                0.13.3
tomlkit                   0.12.0
toolz                     0.12.1
torch                     1.13.1+cu117
torchaudio                0.13.1+cu117
torchvision               0.14.1+cu117
tqdm                      4.66.2
transformers              4.33.2
triton                    2.2.0
typer                     0.12.3
typing_extensions         4.11.0
tzdata                    2024.1
urllib3                   2.2.1
uvicorn                   0.29.0
websockets                11.0.3
wheel                     0.41.2
XlsxWriter                3.1.2
xxhash                    3.4.1
yarl                      1.9.4
zipp                      3.18.1

InternLM / InternLM-XComposer

Training error #284