Open alistairwgillespie opened 9 months ago
Are you able to test with a newer GPU? I do not remember if bnb works well with V100.
@NanoCode012 Any suggested accelerators? A100s are difficult to get ahold of on AWS. Thanks
A6000, L4 . You may also try some alternative providers as aws is quite expensive?
Please check that this issue hasn't been reported before.
Expected Behavior
Hi, I'm trying the public cloud example that trains mistral on AWS expecting a training run to spin up and complete. Instead, I get the following CUDA error. I've modified the config to use a single spot V100. In my testing, I've tried the latest image versions and winglian/axolotl and winglian/axolotl-cloud image sources, which didn't help.
Current behaviour
Steps to reproduce
Steps:
Config yaml
name: axolotl
resources: accelerators: V100:1 cloud: aws # optional use_spot: True
workdir: mistral
file_mounts: /sky-notebook: name: ${BUCKET} mode: MOUNT
setup: | docker pull winglian/axolotl-cloud:main-py3.10-cu118-2.1.2
run: | docker run --gpus all \ -v ~/sky_workdir:/sky_workdir \ -v /root/.cache:/root/.cache \ winglian/axolotl-cloud:main-py3.10-cu118-2.1.2 \ huggingface-cli login --token ${HF_TOKEN}
docker run --gpus all \ -v ~/sky_workdir:/sky_workdir \ -v /root/.cache:/root/.cache \ -v /sky-notebook:/sky-notebook \ winglian/axolotl-cloud:main-py3.10-cu118-2.1.2 \ accelerate launch -m axolotl.cli.train /sky_workdir/lora.yaml
envs: HF_TOKEN: # TODO: Replace with huggingface token
BUCKET:
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements