DRAGNLabs / 301r_retnet

2 stars 1 forks source link

Long net Integration #48

Closed JacksonSearle closed 6 months ago

JacksonSearle commented 6 months ago

We updated torchscale. It now includes LongNet (as coded up by torchscale people). The readme has newer setup instructions for the new version of torchscale. There are two new arguments in the .yaml that are LongNet specific. Model types can still be swapped out by changing the model_type string in the .yaml file.

nprisbrey commented 6 months ago

@JacksonSearle, when @KimballNJardine and I were trying to follow the instructions to the letter on the README to run the LongNet model, we ran into issues on the supercomputer when we tried to train the model. There was an issue with the flash attention v2 and CUDA that prevented things from running (this is the job output file). We then tried to get around possible PyTorch, CUDA, and flash-attn version mismatches by double checking the version of CUDA in the environment and the job node, but this doesn't seem to be an issue since both were running on CUDA version 12.4. We then "loosened" the torch version in the requirements.txt file but ran into different errors when trying to install flash-attn with the newest version of torch (this is the error we received from the pip install flash-attn --no-build-isolation command). We're not sure how to recreate the environment that you ran the model in to test this ourselves. Could you give more details about steps to take/package versions to download to recreate the environment you had success in?

JacksonSearle commented 6 months ago

@nprisbrey Here's some info:

(retnet) (base) bash-4.2$ pip list Package Version


absl-py 2.0.0 accelerate 0.26.1 aiohttp 3.8.4 aiosignal 1.3.1 async-timeout 4.0.2 attrs 23.1.0 cachetools 5.3.2 certifi 2023.11.17 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 colorama 0.4.6 contourpy 1.2.0 cycler 0.12.1 DataProperty 1.0.1 datasets 2.16.0 dill 0.3.6 einops 0.7.0 evaluate 0.4.1 fairscale 0.4.13 filelock 3.13.1 flash-attn 2.5.6 fonttools 4.47.2 frozenlist 1.3.3 fsspec 2023.6.0 google-auth 2.26.2 google-auth-oauthlib 1.2.0 grpcio 1.60.0 huggingface-hub 0.20.2 idna 3.6 Jinja2 3.1.2 joblib 1.3.2 jsonlines 4.0.0 kiwisolver 1.4.5 lightning-utilities 0.10.1 lm_eval 0.4.0 lxml 5.1.0 Markdown 3.5.2 MarkupSafe 2.1.3 matplotlib 3.8.2 mbstrdecoder 1.1.3 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.2.1 ninja 1.11.1.1 nltk 3.8.1 numexpr 2.9.0 numpy 1.26.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 packaging 23.2 pandas 2.0.2 pathvalidate 3.2.0 peft 0.8.2 Pillow 10.1.0 pip 23.3.1 portalocker 2.8.2 protobuf 4.23.4 psutil 5.9.8 pyarrow 12.0.1 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pyparsing 3.1.1 pytablewriter 1.2.0 python-dateutil 2.8.2 pytorch-lightning 2.1.3 pytz 2023.3.post1 PyYAML 6.0.1 regex 2023.6.3 requests 2.31.0 requests-oauthlib 1.3.1 responses 0.18.0 rouge-score 0.1.2 rsa 4.9 sacrebleu 2.4.0 safetensors 0.3.1 scikit-learn 1.4.0 scipy 1.12.0 setuptools 68.0.0 six 1.16.0 sqlitedict 2.1.0 sympy 1.12 tabledata 1.3.3 tabulate 0.9.0 tcolorpy 0.1.4 tensorboard 2.15.1 tensorboard-data-server 0.7.2 threadpoolctl 3.2.0 timm 0.9.12 tokenizers 0.15.0 torch 2.1.1 torchdata 0.7.1 torchinfo 1.8.0 torchmetrics 1.3.1 torchtext 0.16.1 torchvision 0.16.1 tqdm 4.66.1 tqdm-multiprocess 0.0.11 transformers 4.36.2 triton 2.1.0 typepy 1.3.2 typing_extensions 4.6.3 tzdata 2023.3 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.2.0 yarl 1.9.2 zstandard 0.22.0 (retnet) (base) bash-4.2$ cuda --version bash: cuda: command not found (retnet) (base) bash-4.2$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0 (retnet) (base) bash-4.2$ python --version Python 3.11.5

nprisbrey commented 6 months ago

@JacksonSearle does the most recent version of the code still work?

JacksonSearle commented 6 months ago

Hey, sorry, I was working on getting my .yaml up to date with the new changes in main. I had a couple of wrong file paths, not all the required arguments, etc. I'll take a look at it later this afternoon again

nprisbrey commented 6 months ago

@JacksonSearle, any update?