axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.74k stars 851 forks source link

Error running in Lambda Labs VM using instructions in docs #242

Closed ctjlewis closed 1 year ago

ctjlewis commented 1 year ago

After upgrading to Python 3.9 and setting everything up, PyTorch becomes unusable, so the training script fails:

ubuntu@209-20-159-38:~/axolotl$ python
Python 3.9.17 (main, Jun  6 2023, 20:11:04) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/torch/__init__.py", line 443, in <module>
    raise ImportError(textwrap.dedent('''
ImportError: Failed to load PyTorch C extensions:
    It appears that PyTorch has loaded the `torch/_C` folder
    of the PyTorch repository rather than the C extensions which
    are expected in the `torch._C` namespace. This can occur when
    using the `install` workflow. e.g.
        $ python setup.py install && python -c "import torch"

    This error can generally be solved using the `develop` workflow
        $ python setup.py develop && python -c "import torch"  # This should succeed
    or by running Python from a different directory.

I condensed everything into one script, if we can get it working we could add it inside scripts/ for ease of access:

#!/bin/bash

set -e

# Function to gracefully exit if a command fails
abort() {
    echo >&2 '
***************
*** ABORTED ***
***************
'
    echo "An error occurred. Exiting..." >&2
    exit 1
}

trap 'abort' 0

# Update system
sudo apt update

# Install python3.9
sudo apt install -y python3.9
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.9 1
sudo update-alternatives --config python # user must pick 3.9 if given option

# Verify python version
version=$(python -V 2>&1 | grep -Po '(?<=Python )(.+)')
if [[ -z "$version" ]]
then
    echo "Failed to detect python version"
    exit 1
fi
echo "Python version $version installed."

# Install pip
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
rm get-pip.py

# Install torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Ensure setuptools installed
pip3 install -U setuptools

# Install Axolotl
pip3 install -e .

# Install Axolotl dependencies
pip3 install protobuf==3.20.3
pip3 install -U requests scipy
pip3 install --ignore-installed psutil
pip3 install git+https://github.com/huggingface/peft.git # not for gptq

# Set path
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# If we've reached this point, all commands were successful
trap : 0

echo >&2 '
************
*** DONE *** 
************
'
ctjlewis commented 1 year ago

cc @NanoCode012, any advice appreciated to workaround in meantime, will PR this script also when it's stable.

This is confusing because I did not have this issue running on a Colab VM, somehow it was able to make Python 3.9+ work with PyTorch.

NanoCode012 commented 1 year ago

cc @NanoCode012, any advice appreciated to workaround in meantime, will PR this script also when it's stable.

This is confusing because I did not have this issue running on a Colab VM, somehow it was able to make Python 3.9+ work with PyTorch.

Hey, is this lambdalabs?

The error sounds like you failed to install torch correctly. Maybe try to uninstall then reinstall. You can then try Miniconda next as second option. If all fails, you can use docker image.

ctjlewis commented 1 year ago

Yes, this is on a Lambda Labs H100 instance. I tried that, but I can't get torch to uninstall. Thanks for quick response also.

ubuntu@209-20-159-38:~/axolotl$ pip uninstall torch

Found existing installation: torch 2.0.1
Uninstalling torch-2.0.1:
  Would remove:
    /usr/bin/torchrun
    /usr/lib/python3/dist-packages/functorch
    /usr/lib/python3/dist-packages/nvfuser
    /usr/lib/python3/dist-packages/torch
    /usr/lib/python3/dist-packages/torch-2.0.1.egg-info
    /usr/lib/python3/dist-packages/torchgen
Proceed (Y/n)? y
ERROR: Exception:
Traceback (most recent call last):
  File "/usr/lib/python3.9/shutil.py", line 825, in move
    os.rename(src, real_dst)
PermissionError: [Errno 13] Permission denied: '/usr/bin/torchrun' -> '/tmp/pip-uninstall-nd6b1zrl/torchrun'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/commands/uninstall.py", line 105, in run
    uninstall_pathset = req.uninstall(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_install.py", line 680, in uninstall
    uninstalled_pathset.remove(auto_confirm, verbose)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_uninstall.py", line 381, in remove
    moved.stash(path)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/req/req_uninstall.py", line 272, in stash
    renames(path, new_path)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/pip/_internal/utils/misc.py", line 313, in renames
    shutil.move(old, new)
  File "/usr/lib/python3.9/shutil.py", line 846, in move
    os.unlink(src)
PermissionError: [Errno 13] Permission denied: '/usr/bin/torchrun'

ubuntu@209-20-159-38:~/axolotl$ sudo pip uninstall torch

Found existing installation: torch 2.0.1
Not uninstalling torch at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'torch'. No files were found to uninstall.

ubuntu@209-20-159-38:~/axolotl$ sudo -H pip uninstall torch

Found existing installation: torch 2.0.1
Not uninstalling torch at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'torch'. No files were found to uninstall.
NanoCode012 commented 1 year ago

Hey, seems like somehow the permissions got stuck? I would recommend just rebooting a new one to start from scratch. I would recommend Miniconda then to prevent this issue. I don't think you need to use sudo at all.

Lastly, I just wanted to give fyi, I also have issues with H100 on lambdalabs particularly with bitsandbytes and xformers in case you're using those!

ctjlewis commented 1 year ago

I see what's happening - pytorch is being provided by apt packages:

ubuntu@209-20-159-38:~/axolotl$ dpkg -l | grep torch
ii  python3-torch-cuda                     2.0.1+ds-0lambda0.20.04.1                 amd64        Tensors and Dynamic neural networks GPU accelerated (Python 3)
ii  python3-torchvision-cuda               0.15.1-0lambda0.20.04.1                   amd64        Image and video datasets and models for PyTorch (Python 3, CUDA)
ctjlewis commented 1 year ago

We should be good to wipe these out and handle it via PyTorch install instructions, right?

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Which is already in the script. Let's see.

NanoCode012 commented 1 year ago

You can pass a -U to upgrade or a --ignore-installed so it force installs. I was planning to add that flag but didn't get to.

ctjlewis commented 1 year ago

We got it 😎

ctjlewis commented 1 year ago

Will finalize this tomorrow:

https://gist.github.com/ctjlewis/7540d88f4ddb93d36e7515fb1b911833

ctjlewis commented 1 year ago

This is my first few minutes of training loss with xformers disabled (we need PyTorch nightly for H100, can't use xformers with PyTorch nightly, have to disable xformers):

image

Which seems like an issue but I'll give it an hour or so.

Update

Oh no...

image image

This is just the regular OpenLLaMA 13B config with xformers disabled.

ctjlewis commented 1 year ago

@NanoCode012 Hate to bump but what do you make of that train loss?

The only config change is xformers was disabled bc it was incompatible with PyTorch nightly.

NanoCode012 commented 1 year ago

Use bf16 @ctjlewis

Regarding pytorch, I don't think you need nightly. Though, if there's any new improvement in nightly, please do tell.

ctjlewis commented 1 year ago

Use bf16 @ctjlewis

Regarding pytorch, I don't think you need nightly. Though, if there's any new improvement in nightly, please do tell.

I had to use nightly to get rid of errors about H100 support (size sm_90).

With bf16 I ran into that __function__ error. I tried all config permutations per that issue about it:

  1. bf16: true in config AND --mixed-precision bfloat16 in accelerate launch command (what I believed to be solution)
  2. bf16: true in config ONLY
  3. --mixed-precision bfloat16 in accelerate launch command ONLY

But couldn't get rid of that error.

NanoCode012 commented 1 year ago

I had to use nightly to get rid of errors about H100 support (size sm_90).

I use an older version of transformers 4.29.2. However, you'll need to modify source code to comment out 4bit.

ctjlewis commented 1 year ago

I had to use nightly to get rid of errors about H100 support (size sm_90).

I use an older version of transformers 4.29.2. However, you'll need to modify source code to comment out 4bit.

Or maybe the Axolotl library could provide that instead.

I did get the job started with nightly and H100 support, no xformers but if I could fix the bf16 error that would've been sufficient, without manually editing packages source.

How do I properly set bf16 for the LLaMA run? Only config, only command line argument, both?

Once I get this stable I'll PR the scripts in so it's a one-line thing for H100s on Lambda.

NanoCode012 commented 1 year ago

Yes, I think the 4bit part can be improved.

Unfortunately, I mainly only notice on H100 or so may be facing this issue, so there weren't many eyes.

I'm not sure why xformers or more specifically triton does not work as well on H100.

You can set bf16: true within config.

ctjlewis commented 1 year ago

When the only config changes I make are:

I get that function error:

ubuntu@209-20-159-22:~/axolotl$ accelerate launch scripts/finetune.py examples/openllama-3b/config.yml 
2023-06-26 18:24:35.086204: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-26 18:24:35.146711: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 18:24:35.921006: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-06-26 18:24:40.826320: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/lib/x86_64-linux-gnu/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
INFO:root:loading tokenizer... openlm-research/open_llama_3b
Using pad_token, but it is not set yet.
INFO:root:Loading prepared dataset from disk at last_run_prepared/c6b6388039831944360f60b07eaffe22...
INFO:root:Prepared dataset loaded from disk...
INFO:root:loading model and peft_config...
WARNING:accelerate.utils.modeling:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
INFO:root:Compiling torch model
INFO:root:Starting trainer...
Traceback (most recent call last):
  File "/home/ubuntu/axolotl/scripts/finetune.py", line 352, in <module>
    fire.Fire(train)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/axolotl/scripts/finetune.py", line 337, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1756, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
    torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 941, in launch_command
    simple_launcher(args)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'scripts/finetune.py', 'examples/openllama-3b/config.yml']' returned non-zero exit status 1.
ctjlewis commented 1 year ago

@NanoCode012 I moved off Lambda Labs, now on 8x A100 80GB.

xformers off, bf16: true, but the model still does not learn. I wonder if the hyperparams are messed up?

image
NanoCode012 commented 1 year ago

That's weird. That could be. You're also training for quite a lot of epochs.

ctjlewis commented 1 year ago

@NanoCode012 @winglian Would it be possible for us to check that the vanilla Falcon/OpenLLaMA configs are working? I can't identify what's going wrong, even tweaking learning rate the behavior seems unpredictable.

NanoCode012 commented 1 year ago

Just a note, your dataset might also be a factor in this. If you can run the configs successfully, then it's usually the hyperparameters or the dataset.

jphme commented 1 year ago

Any news on this? I tried getting Axolotl to run on Llamalabs for hours now without success...

Using the provided script with some fixed I got additional errors with Tensorflow and psutil and also to force reinstall both to get anything running.

Now basically everything fails with diverse errors

e.g. with the examples: accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml

ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_nf4
Traceback (most recent call last):

or

accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml

results in

ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

I´m not very experienced - are these axolotl issues or issues of the downstream libraries?

Thanks!

NanoCode012 commented 1 year ago

Hey @jpdus , I think you can raise a separate issue as this one seems to be solved. The author's last few comments were just on performance.

I see a potential issue with bitsandbytes not compiled with GPU, so I would also recommend you recheck you ran all the steps and post your config as well.

@ctjlewis , feel free to close this one and discuss in discord or in separate issue if your training is not improving.

NanoCode012 commented 1 year ago

Closing this as it seems to be solved. Please re-open if problem comes back.