Closed ctjlewis closed 1 year ago
cc @NanoCode012, any advice appreciated to workaround in meantime, will PR this script also when it's stable.
This is confusing because I did not have this issue running on a Colab VM, somehow it was able to make Python 3.9+ work with PyTorch.
cc @NanoCode012, any advice appreciated to workaround in meantime, will PR this script also when it's stable.
This is confusing because I did not have this issue running on a Colab VM, somehow it was able to make Python 3.9+ work with PyTorch.
Hey, is this lambdalabs?
The error sounds like you failed to install torch correctly. Maybe try to uninstall then reinstall. You can then try Miniconda next as second option. If all fails, you can use docker image.
Yes, this is on a Lambda Labs H100 instance. I tried that, but I can't get torch to uninstall. Thanks for quick response also.
ubuntu@209-20-159-38:~/axolotl$ pip uninstall torch
Found existing installation: torch 2.0.1
Uninstalling torch-2.0.1:
Would remove:
/usr/bin/torchrun
/usr/lib/python3/dist-packages/functorch
/usr/lib/python3/dist-packages/nvfuser
/usr/lib/python3/dist-packages/torch
/usr/lib/python3/dist-packages/torch-2.0.1.egg-info
/usr/lib/python3/dist-packages/torchgen
Proceed (Y/n)? y
ERROR: Exception:
Traceback (most recent call last):
File "/usr/lib/python3.9/shutil.py", line 825, in move
os.rename(src, real_dst)
PermissionError: [Errno 13] Permission denied: '/usr/bin/torchrun' -> '/tmp/pip-uninstall-nd6b1zrl/torchrun'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
status = run_func(*args)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/commands/uninstall.py", line 105, in run
uninstall_pathset = req.uninstall(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_install.py", line 680, in uninstall
uninstalled_pathset.remove(auto_confirm, verbose)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_uninstall.py", line 381, in remove
moved.stash(path)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_uninstall.py", line 272, in stash
renames(path, new_path)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/utils/misc.py", line 313, in renames
shutil.move(old, new)
File "/usr/lib/python3.9/shutil.py", line 846, in move
os.unlink(src)
PermissionError: [Errno 13] Permission denied: '/usr/bin/torchrun'
ubuntu@209-20-159-38:~/axolotl$ sudo pip uninstall torch
Found existing installation: torch 2.0.1
Not uninstalling torch at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'torch'. No files were found to uninstall.
ubuntu@209-20-159-38:~/axolotl$ sudo -H pip uninstall torch
Found existing installation: torch 2.0.1
Not uninstalling torch at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'torch'. No files were found to uninstall.
Hey, seems like somehow the permissions got stuck? I would recommend just rebooting a new one to start from scratch. I would recommend Miniconda then to prevent this issue. I don't think you need to use sudo at all.
Lastly, I just wanted to give fyi, I also have issues with H100 on lambdalabs particularly with bitsandbytes and xformers in case you're using those!
I see what's happening - pytorch is being provided by apt packages:
ubuntu@209-20-159-38:~/axolotl$ dpkg -l | grep torch
ii python3-torch-cuda 2.0.1+ds-0lambda0.20.04.1 amd64 Tensors and Dynamic neural networks GPU accelerated (Python 3)
ii python3-torchvision-cuda 0.15.1-0lambda0.20.04.1 amd64 Image and video datasets and models for PyTorch (Python 3, CUDA)
We should be good to wipe these out and handle it via PyTorch install instructions, right?
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Which is already in the script. Let's see.
You can pass a -U
to upgrade or a --ignore-installed
so it force installs. I was planning to add that flag but didn't get to.
We got it 😎
Will finalize this tomorrow:
https://gist.github.com/ctjlewis/7540d88f4ddb93d36e7515fb1b911833
This is my first few minutes of training loss with xformers disabled (we need PyTorch nightly for H100, can't use xformers with PyTorch nightly, have to disable xformers):
Which seems like an issue but I'll give it an hour or so.
Oh no...
This is just the regular OpenLLaMA 13B config with xformers disabled.
@NanoCode012 Hate to bump but what do you make of that train loss?
The only config change is xformers was disabled bc it was incompatible with PyTorch nightly.
Use bf16 @ctjlewis
Regarding pytorch, I don't think you need nightly. Though, if there's any new improvement in nightly, please do tell.
Use bf16 @ctjlewis
Regarding pytorch, I don't think you need nightly. Though, if there's any new improvement in nightly, please do tell.
I had to use nightly to get rid of errors about H100 support (size sm_90).
With bf16 I ran into that __function__
error. I tried all config permutations per that issue about it:
--mixed-precision bfloat16
in accelerate launch command (what I believed to be solution)--mixed-precision bfloat16
in accelerate launch command ONLYBut couldn't get rid of that error.
I had to use nightly to get rid of errors about H100 support (size sm_90).
I use an older version of transformers 4.29.2. However, you'll need to modify source code to comment out 4bit.
I had to use nightly to get rid of errors about H100 support (size sm_90).
I use an older version of transformers 4.29.2. However, you'll need to modify source code to comment out 4bit.
Or maybe the Axolotl library could provide that instead.
I did get the job started with nightly and H100 support, no xformers but if I could fix the bf16 error that would've been sufficient, without manually editing packages source.
How do I properly set bf16 for the LLaMA run? Only config, only command line argument, both?
Once I get this stable I'll PR the scripts in so it's a one-line thing for H100s on Lambda.
Yes, I think the 4bit part can be improved.
Unfortunately, I mainly only notice on H100 or so may be facing this issue, so there weren't many eyes.
I'm not sure why xformers or more specifically triton does not work as well on H100.
You can set bf16: true
within config.
When the only config changes I make are:
xformers_attention: false
bf16: true
I get that function error:
ubuntu@209-20-159-22:~/axolotl$ accelerate launch scripts/finetune.py examples/openllama-3b/config.yml
2023-06-26 18:24:35.086204: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-26 18:24:35.146711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 18:24:35.921006: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-06-26 18:24:40.826320: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/lib/x86_64-linux-gnu/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
INFO:root:loading tokenizer... openlm-research/open_llama_3b
Using pad_token, but it is not set yet.
INFO:root:Loading prepared dataset from disk at last_run_prepared/c6b6388039831944360f60b07eaffe22...
INFO:root:Prepared dataset loaded from disk...
INFO:root:loading model and peft_config...
WARNING:accelerate.utils.modeling:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
INFO:root:Compiling torch model
INFO:root:Starting trainer...
Traceback (most recent call last):
File "/home/ubuntu/axolotl/scripts/finetune.py", line 352, in <module>
fire.Fire(train)
File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/ubuntu/axolotl/scripts/finetune.py", line 337, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1756, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
result = tuple(
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 941, in launch_command
simple_launcher(args)
File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'scripts/finetune.py', 'examples/openllama-3b/config.yml']' returned non-zero exit status 1.
@NanoCode012 I moved off Lambda Labs, now on 8x A100 80GB.
xformers off, bf16: true, but the model still does not learn. I wonder if the hyperparams are messed up?
That's weird. That could be. You're also training for quite a lot of epochs.
@NanoCode012 @winglian Would it be possible for us to check that the vanilla Falcon/OpenLLaMA configs are working? I can't identify what's going wrong, even tweaking learning rate the behavior seems unpredictable.
Just a note, your dataset might also be a factor in this. If you can run the configs successfully, then it's usually the hyperparameters or the dataset.
Any news on this? I tried getting Axolotl to run on Llamalabs for hours now without success...
Using the provided script with some fixed I got additional errors with Tensorflow and psutil and also to force reinstall both to get anything running.
Now basically everything fails with diverse errors
e.g. with the examples:
accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml
ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_nf4
Traceback (most recent call last):
or
accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml
results in
ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
I´m not very experienced - are these axolotl issues or issues of the downstream libraries?
Thanks!
Hey @jpdus , I think you can raise a separate issue as this one seems to be solved. The author's last few comments were just on performance.
I see a potential issue with bitsandbytes not compiled with GPU, so I would also recommend you recheck you ran all the steps and post your config as well.
@ctjlewis , feel free to close this one and discuss in discord or in separate issue if your training is not improving.
Closing this as it seems to be solved. Please re-open if problem comes back.
After upgrading to Python 3.9 and setting everything up, PyTorch becomes unusable, so the training script fails:
I condensed everything into one script, if we can get it working we could add it inside
scripts/
for ease of access: