yilunzhao commented 1 year ago

46, re-update the implementation for llama, alpaca, santacoder

niansong1996 commented 1 year ago

Seems like there is an error from CI. I've seem this before, check here to see if it's useful.

yilunzhao commented 1 year ago

Hi @niansong1996, sorry for the late reply. I have resolved the CI error. It seems that I have to change the transformers version in requirements.txt to avoid the error.

niansong1996 commented 1 year ago

That is okay, what we can do is to use this branch to evaluate the new models before we decide the upgrade the transformers version in the main branch.

niansong1996 commented 1 year ago

@yilunzhao I am getting the following error when testing LLaMA: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:46, unhandled cuda error, NCCL version 2.10.3

The command I ran is: python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/spider.yaml --model.beam_size 1 --data.val_max_instances 1 --data.val_batch_size 1 --model.print_generation_results true --model.print_eval_every_n_batches 1 --model.init_args.transformer_model_name decapoda-research/llama-7b-hf --data.init_args.transformer_model_name decapoda-research/llama-7b-hf --trainer.devices 2

Now if I use one GPU, I will get this error: RuntimeError: CUDA error: no kernel image is available for execution on the device

Can you see if you can replicate those errors and figure out why they are happening?

yilunzhao commented 1 year ago

Hi @niansong1996, I think this error raised because the installed torch is incompatible with CUDA in ziva. Could you please try to re-install the torch by pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116, and see if it can resolve the issue?

And this is my pip freeze:

absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.1.0
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
deepspeed==0.6.7
docker-pycreds==0.4.0
docopt==0.6.2
docstring-parser==0.15
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.4.0
func-timeout==4.3.5
gitdb==4.0.10
GitPython==3.1.31
google-auth==2.17.3
google-auth-oauthlib==1.0.0
grpcio==1.54.0
hjson==3.1.0
huggingface-hub==0.14.1
idna==3.4
importlib-metadata==6.6.0
joblib==1.2.0
jsonargparse==4.15.0
Markdown==3.4.3
MarkupSafe==2.1.2
multidict==6.0.4
ninja==1.11.1
nltk==3.8.1
numpy==1.24.3
oauthlib==3.2.2
openai==0.27.5
overrides==7.3.1
packaging==23.1
pandas==2.0.1
pathtools==0.1.2
Pillow==9.5.0
pipreqs==0.4.13
protobuf==4.22.3
psutil==5.9.5
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.7
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.7.4
pytz==2023.3
PyYAML==6.0
regex==2023.3.23
requests==2.29.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.10.1
sentencepiece==0.1.98
sentry-sdk==1.21.0
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sqlparse==0.4.4
tensorboard==2.12.2
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tokenizers==0.13.3
torch==1.12.1+cu116
torchaudio==0.12.1+cu116
torchmetrics==0.9.3
torchvision==0.13.1+cu116
tqdm==4.65.0
transformers @ git+https://github.com/huggingface/transformers@11fd2c773b11c3fcfe0fa25aa4b92db03c83636c
tree-sitter==0.19.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==1.26.15
wandb==0.15.0
Werkzeug==2.3.1
yarg==0.1.9
yarl==1.9.2
zipp==3.15.0

Yale-LILY / NLP4Code

Re-update the LLM implementation #48

46, re-update the implementation for llama, alpaca, santacoder