Closed yilunzhao closed 1 year ago
Seems like there is an error from CI. I've seem this before, check here to see if it's useful.
Hi @niansong1996, sorry for the late reply. I have resolved the CI error. It seems that I have to change the transformers
version in requirements.txt
to avoid the error.
That is okay, what we can do is to use this branch to evaluate the new models before we decide the upgrade the transformers
version in the main branch.
@yilunzhao I am getting the following error when testing LLaMA:
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:46, unhandled cuda error, NCCL version 2.10.3
The command I ran is:
python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/spider.yaml --model.beam_size 1 --data.val_max_instances 1 --data.val_batch_size 1 --model.print_generation_results true --model.print_eval_every_n_batches 1 --model.init_args.transformer_model_name decapoda-research/llama-7b-hf --data.init_args.transformer_model_name decapoda-research/llama-7b-hf --trainer.devices 2
Now if I use one GPU, I will get this error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
Can you see if you can replicate those errors and figure out why they are happening?
Hi @niansong1996, I think this error raised because the installed torch is incompatible with CUDA in ziva. Could you please try to re-install the torch by pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
, and see if it can resolve the issue?
And this is my pip freeze:
absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.1.0
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
deepspeed==0.6.7
docker-pycreds==0.4.0
docopt==0.6.2
docstring-parser==0.15
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.4.0
func-timeout==4.3.5
gitdb==4.0.10
GitPython==3.1.31
google-auth==2.17.3
google-auth-oauthlib==1.0.0
grpcio==1.54.0
hjson==3.1.0
huggingface-hub==0.14.1
idna==3.4
importlib-metadata==6.6.0
joblib==1.2.0
jsonargparse==4.15.0
Markdown==3.4.3
MarkupSafe==2.1.2
multidict==6.0.4
ninja==1.11.1
nltk==3.8.1
numpy==1.24.3
oauthlib==3.2.2
openai==0.27.5
overrides==7.3.1
packaging==23.1
pandas==2.0.1
pathtools==0.1.2
Pillow==9.5.0
pipreqs==0.4.13
protobuf==4.22.3
psutil==5.9.5
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.7
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.7.4
pytz==2023.3
PyYAML==6.0
regex==2023.3.23
requests==2.29.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.10.1
sentencepiece==0.1.98
sentry-sdk==1.21.0
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sqlparse==0.4.4
tensorboard==2.12.2
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tokenizers==0.13.3
torch==1.12.1+cu116
torchaudio==0.12.1+cu116
torchmetrics==0.9.3
torchvision==0.13.1+cu116
tqdm==4.65.0
transformers @ git+https://github.com/huggingface/transformers@11fd2c773b11c3fcfe0fa25aa4b92db03c83636c
tree-sitter==0.19.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==1.26.15
wandb==0.15.0
Werkzeug==2.3.1
yarg==0.1.9
yarl==1.9.2
zipp==3.15.0
46, re-update the implementation for llama, alpaca, santacoder