Increase in training speed, pip list

Here is my pip list of items installed which took the training time from 24 hours on 6 rtx 4090 to 13 hours. I used the newer versions of all the apps. Significant speed increase. Please note i also have the amd cpu drivers and p2p driver installed for Nvidia, you also have to compile from source flash-attn,python, and install cuda 12.4:` pip3 list Package Version Editable project location

absl-py 2.1.0 accelerate 0.34.2 aiofiles 22.1.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 aiosqlite 0.20.0 altair 5.4.1 annotated-types 0.7.0 anyio 4.4.0 appdirs 1.4.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrs 24.2.0 av 13.0.0 babel 2.16.0 beartype 0.14.1 beautifulsoup4 4.12.3 better-abc 0.0.3 bidict 0.23.1 bitsandbytes 0.43.3 black 24.1.0 bleach 6.1.0 cachetools 5.5.0 certifi 2024.8.30 cffi 1.17.1 cfgv 3.4.0 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 cmake 3.30.2 colorama 0.4.6 comm 0.2.2 contourpy 1.3.0 crcmod 1.7 cryptography 43.0.1 cuda-python 12.4.0 /home/myles/cuda-python-12.4.0 cycler 0.12.1 Cython 3.0.11 DataProperty 1.0.1 datasets 2.16.1 debugpy 1.8.5 decorator 5.1.1 decord 0.6.0 deepspeed 0.15.2+fc22d960 deepspeed-kernels 0.0.1.dev1698255861 defusedxml 0.7.1 Deprecated 1.2.14 dill 0.3.7 distlib 0.3.8 distro 1.9.0 dnspython 2.6.1 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 einops-exts 0.0.4 entrypoints 0.4 et-xmlfile 1.1.0 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 executing 2.1.0 fancy-einsum 0.0.3 fastapi 0.112.4 fastjsonschema 2.20.0 ffmpy 0.4.0 filelock 3.16.0 flash_attn 2.6.3 fonttools 4.53.1 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.10.0 ftfy 6.2.3 gitdb 4.0.11 GitPython 3.1.43 gradio 4.43.0 gradio_client 1.3.0 grpcio 1.66.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 identify 2.6.0 idna 3.8 importlib_metadata 8.4.0 importlib_resources 6.4.4 iniconfig 2.0.0 ipaddress 1.0.23 ipykernel 6.29.5 ipython 8.27.0 ipython-genutils 0.2.0 ipywidgets 8.1.5 isoduration 20.11.0 jaxtyping 0.2.34 jedi 0.19.1 Jinja2 3.1.4 jiter 0.5.0 joblib 1.4.2 json5 0.9.25 jsonlines 4.0.0 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter 1.1.1 jupyter_client 8.6.2 jupyter-console 6.6.3 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.2 jupyter_server_fileid 0.9.3 jupyter_server_terminals 0.5.3 jupyter_server_ydoc 0.8.0 jupyter-ydoc 0.3.4 jupyterlab 4.2.5 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 jupyterlab_widgets 3.0.13 kiwisolver 1.4.7 latex2mathml 3.77.0 Levenshtein 0.25.1 linkify-it-py 2.0.3 llava 1.7.0.dev0 /home/myles/LLaVA-NeXT lmms_eval 0.1.2 lxml 5.3.0 markdown-it-py 3.0.0 markdown2 2.5.0 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mbstrdecoder 1.1.3 mdit-py-plugins 0.4.1 mdurl 0.1.2 mistune 3.0.2 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.15 mypy-extensions 1.0.0 narwhals 1.6.2 nbclassic 1.1.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 nodeenv 1.9.1 notebook 7.2.2 notebook_shim 0.2.4 numexpr 2.10.1 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-cutlass 3.5.1.0 /home/myles/cutlass nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 nvidia-pyindex 1.0.9 open_clip_torch 2.26.1 openai 1.44.0 opencv-python 4.10.0.84 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 orjson 3.10.7 overrides 7.7.0 packaging 24.1 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 pathlib2 2.3.7.post1 pathspec 0.12.1 pathvalidate 3.2.1 peft 0.12.0 pexpect 4.9.0 Pillow 10.1.0 pip 24.2 platformdirs 4.3.1 pluggy 1.5.0 ply 3.11 portalocker 2.10.1 pre-commit 3.8.0 prometheus_client 0.20.0 promise 2.3 prompt_toolkit 3.0.47 protobuf 5.28.0 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py 1.11.0 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.5 pycocoevalcap 1.2 pycocotools 2.0.8 pycparser 2.22 pycryptodomex 3.20.0 pydantic 2.9.0 pydantic_core 2.23.2 pydot 3.0.1 pydub 0.25.1 Pygments 2.18.0 PyJWT 2.9.0 pynvml 11.5.3 pyOpenSSL 24.2.1 pyparsing 3.1.4 pyproject-api 1.7.1 pytablewriter 1.2.0 pytest 8.3.2 python-consul 1.1.0 python-dateutil 2.9.0.post0 python-engineio 4.9.1 python-etcd 0.4.5 python-json-logger 2.0.7 python-multipart 0.0.9 python-socketio 5.11.4 pytorch-triton 3.0.0+757b6a61e7 pytz 2024.1 PyYAML 6.0.2 pyzmq 26.2.0 qtconsole 5.6.0 QtPy 2.4.1 rapidfuzz 3.9.7 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 responses 0.25.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.8.0 rouge_score 0.1.2 rpds-py 0.20.0 ruff 0.6.4 sacrebleu 2.4.3 safetensors 0.4.5 schedule 1.2.2 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 Send2Trash 1.8.3 sentencepiece 0.2.0 sentry-sdk 2.13.0 setproctitle 1.3.3 setuptools 70.2.0 shellingham 1.5.4 shortuuid 1.0.13 shtab 1.7.1 simple-websocket 1.0.0 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.6 sqlitedict 2.1.0 stack-data 0.6.3 starlette 0.38.4 svgwrite 1.4.3 sympy 1.13.1 tabledata 1.3.3 tabulate 0.9.0 tcolorpy 0.1.6 tenacity 9.0.0 terminado 0.18.1 threadpoolctl 3.5.0 thriftpy2 0.5.2 tiktoken 0.7.0 timm 1.0.9 tinycss2 1.3.0 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.5.0.dev20240907+cu124 torchaudio 2.5.0.dev20240907+cu124 torchvision 0.20.0.dev20240907+cu124 tornado 6.4.1 tox 4.18.1 tqdm 4.66.5 tqdm-multiprocess 0.0.11 traitlets 5.14.3 transformer-lens 2.4.1 transformers 4.44.2 transformers-stream-generator 0.0.5 treelib 1.7.0 triton 3.0.0 typeguard 2.13.3 typepy 1.3.2 typer 0.12.5 types-python-dateutil 2.9.0.20240906 typing_extensions 4.12.2 tyro 0.8.10 tzdata 2024.1 uc-micro-py 1.0.3 uri-template 1.3.0 urllib3 2.2.2 uvicorn 0.30.6 virtualenv 20.26.4 wandb 0.17.9 watchdog 5.0.2 wavedrom 2.0.3.post3 wcwidth 0.2.13 webcolors 24.8.0 webencodings 0.5.1 websocket-client 1.8.0 websockets 12.0 wheel 0.44.0 widgetsnbextension 4.0.13 wrapt 1.16.0 wsproto 1.2.0 xxhash 3.5.0 y-py 0.6.2 yarl 1.10.0 ypy-websocket 0.8.4 zipp 3.20.1 zstandard 0.23.0 ` that is for the pretrain step. You have to edit the requirements in the Llava next and flash-attn source. Perhaps will try compile pytorch also see if that increases speed. Yet if it took 5 hours on 8 a100 gpus to train and its taking 14 hour on 6 rtx 4090 is considerably cheaper option. python version is 12.6 {'loss': 2.2755, 'grad_norm': 0.39399619770693817, 'learning_rate': 0.0002123281653978702, 'epoch': 0.7} {'loss': 2.1398, 'grad_norm': 0.37070551410852476, 'learning_rate': 0.00021181571289009837, 'epoch': 0.7} {'loss': 2.2116, 'grad_norm': 0.3767641455157773, 'learning_rate': 0.00021130371328809583, 'epoch': 0.71} {'loss': 2.2108, 'grad_norm': 0.37521066280004145, 'learning_rate': 0.00021079216739651292, 'epoch': 0.71} {'loss': 2.2801, 'grad_norm': 0.41099287831954295, 'learning_rate': 0.00021028107601928664, 'epoch': 0.71} {'loss': 2.2858, 'grad_norm': 0.3960054936394853, 'learning_rate': 0.0002097704399596404, 'epoch': 0.71} {'loss': 2.2797, 'grad_norm': 0.3966764350684436, 'learning_rate': 0.00020926026002008135, 'epoch': 0.71} 71%|████████████████████████▋ | 1826/2584 [9:42:12<4:02:01, 19.16s/it]

NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off | | 63% 67C P2 248W / 400W | 23075MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 On | 00000000:02:00.0 On | Off | | 42% 55C P2 257W / 400W | 23917MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 On | 00000000:2B:00.0 Off | Off | | 39% 60C P2 261W / 400W | 23819MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off | | 38% 53C P2 254W / 400W | 23447MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce RTX 4090 On | 00000000:42:00.0 Off | Off | | 34% 55C P2 278W / 400W | 21821MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off | | 58% 60C P2 273W / 400W | 23915MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

LLaVA-VL / LLaVA-NeXT

Increase in training speed, pip list #220