Open radhacr opened 5 months ago
inference should really be done with single a process using python -m axolotl.cli.inference ...
instead of accelerate
Same thing happens even without accelerate.
$ python -m axolotl.cli.inference examples/gemma/qlora.yml --lora_dir="axolotl-gemma-aplaca-qlora/"
[2024-04-10 13:45:56,207] [INFO] [datasets.<module>:58] [PID:2747] PyTorch version 2.2.2 available.
[2024-04-10 13:45:58,395] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-10 13:46:00,709] [DEBUG] [axolotl.normalize_config:79] [PID:2747] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-04-10 13:46:01,031] [INFO] [axolotl.normalize_config:182] [PID:2747] [RANK:0] GPU memory usage baseline: 0.000GB (+0.849GB misc)
[2024-04-10 13:46:01,033] [INFO] [axolotl.common.cli.load_model_and_tokenizer:50] [PID:2747] [RANK:0] loading tokenizer... mhenrichsen/gemma-7b
[2024-04-10 13:46:02,413] [DEBUG] [axolotl.load_tokenizer:252] [PID:2747] [RANK:0] EOS: 1 / <eos>
[2024-04-10 13:46:02,413] [DEBUG] [axolotl.load_tokenizer:253] [PID:2747] [RANK:0] BOS: 2 / <bos>
[2024-04-10 13:46:02,413] [DEBUG] [axolotl.load_tokenizer:254] [PID:2747] [RANK:0] PAD: 0 / <pad>
[2024-04-10 13:46:02,413] [DEBUG] [axolotl.load_tokenizer:255] [PID:2747] [RANK:0] UNK: 3 / <unk>
[2024-04-10 13:46:02,413] [INFO] [axolotl.load_tokenizer:266] [PID:2747] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-04-10 13:46:02,413] [INFO] [axolotl.common.cli.load_model_and_tokenizer:52] [PID:2747] [RANK:0] loading model and (optionally) peft_config...
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
[2024-04-10 13:46:06,189] [INFO] [accelerate.utils.modeling.get_balanced_memory:965] [PID:2747] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use mor
e memory (at your own risk).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:37<00:00, 24.41s/it]
[2024-04-10 13:47:44,972] [INFO] [axolotl.load_model:654] [PID:2747] [RANK:0] GPU memory usage after model load: 5.216GB (+0.028GB cache, +1.984GB misc)
[2024-04-10 13:47:44,982] [INFO] [axolotl.load_model:700] [PID:2747] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2024-04-10 13:47:44,985] [INFO] [axolotl.load_model:709] [PID:2747] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-10 13:47:44,988] [INFO] [axolotl.load_lora:853] [PID:2747] [RANK:0] found linear modules: ['gate_proj', 'q_proj', 'k_proj', 'up_proj', 'down_proj', 'v_proj', 'o_proj']
trainable params: 100,007,936 || all params: 8,637,688,832 || trainable%: 1.1578089688702506
[2024-04-10 13:47:46,191] [INFO] [axolotl.load_model:754] [PID:2747] [RANK:0] GPU memory usage after adapters: 5.588GB (+2.720GB cache, +1.984GB misc)
================================================================================
Give me an instruction (Ctrl + D to submit):
Give me some health tips.
========================================
<bos>Give me some health Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/radhachitta/axolotl/src/axolotl/cli/inference.py", line 36, in <module>
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/radhachitta/axolotl/src/axolotl/cli/inference.py", line 32, in do_cli
do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/home/radhachitta/axolotl/src/axolotl/cli/__init__.py", line 206, in do_inference
generated = model.generate(
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1190, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1577, in generate
result = self._sample(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2733, in _sample
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1098, in forward
outputs = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 923, in forward
layer_outputs = decoder_layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 643, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 341, in forward
value_states = self.v_proj(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 458, in forward
result = result.clone()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I get the same issue with qwen
Same issue with codestral on exactly the middle of epoch
{'loss': 0.6265, 'grad_norm': 8.632710456848145, 'learning_rate': 0.00012, 'epoch': 0.48}
25%|███████████ | 6/24 [06:05<18:30, 61.71s/it]Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 165, in fork_rng
yield
File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 271, in backward
outputs = ctx.run_function(*detached_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 611, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 131, in flashattn_forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/peft/tuners/lora/bnb.py", line 217, in forward
result = self.base_layer(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 797, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 321, in forward
CA, CAt, SCA, SCAt, coo_tensorA = F.double_quant(A.to(torch.float16), threshold=state.threshold)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/bitsandbytes/functional.py", line 2535, in double_quant
nnz = nnz_row_ptr[-1].item()
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 70, in <module>
fire.Fire(do_cli)
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 66, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/train.py", line 170, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2125, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 257, in backward
with torch.random.fork_rng(
File "/opt/conda/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 169, in fork_rng
device_mod.set_rng_state(device_rng_state, device)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 75, in set_rng_state
_lazy_call(cb)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 229, in _lazy_call
callable()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 73, in cb
default_generator.set_state(new_state_copy)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
If you add CUDA_LAUNCH_BLOCKING=1 to your env vars, can you report the updated stack trace?
Last update: idk what happened but it is working now! Thanks @winglian, it might be env you suggested + disabling sample_packing/pad_to_sequence_len!
@winglian thanks for your reply, it seems to crash right away with this setting. Please let me know how else I can assist.
FYI I play around with A100 80gb, doing 8bit lora with rslora on. I'm also modifying default tokenizer to suit chatml.
Update: it seems to run after setting
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false
IDK how well it'll run it tho..
Also, I have some extra data I can't use due to error too... It is chatml formatted so idk what is wrong..
[2024-06-07 02:56:22,334] [INFO] [axolotl.callbacks.on_train_begin:785] [PID:2737] [RANK:0] The Axolotl config has been saved to the WandB run under files.
0%| | 0/50 [00:00<?, ?it/s][2024-06-07 02:56:22,336] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2737] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 641630
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 70, in <module>
fire.Fire(do_cli)
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/cli/train.py", line 66, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/work/axolotl/src/axolotl/train.py", line 170, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2125, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 321, in backward
_flash_attn_varlen_backward(
File "/opt/conda/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 181, in _flash_attn_varlen_backward
dq, dk, dv, softmax_d, = flash_attn_cuda.varlen_bwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I was able to train model with @winglian suggested env:
CUDA_LAUNCH_BLOCKING=1
And the following config:
base_model: mistralai/Codestral-22B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true
load_in_8bit: true
load_in_4bit: false
strict: false
datasets:
- path: long_sys_msg_all_data_v000.jsonl
conversation: chatml
type: sharegpt
test_datasets:
- path: 000refined_neo_dataset_v2eval.jsonl
split: train
conversation: chatml
type: sharegpt
chat_template: chatml
adapter: lora
peft_use_rslora: true
lora_r: 64
lora_alpha: 32
lora_dropout: 0
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
dataset_prepared_path:
val_set_size: 0
output_dir: Colibri22bOut
sequence_len: 4096
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false
save_safetensors: true
wandb_project: Colibri22b
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
save_total_limit: 1
gradient_accumulation_steps: 6
micro_batch_size: 1
eval_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: True
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
save_strategy: "no"
warmup_steps: 10
evals_per_epoch: 2
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
special_tokens:
bos_token: "<s>"
eos_token: "<|im_end|>"
unk_token: "<unk>"
lora_modules_to_save:
- embed_tokens
- lm_head
tokens:
- "<|im_start|>"
Eval loss looks delicious! This model is hella smart!!
And train loss:
Training dataset is only 352 refined examples.
Hmm... when I use the model it is not good, will try to figure out why next time.
Please check that this issue hasn't been reported before.
Expected Behavior
The training with the Gemma qlora config in the examples runs fine. But the inference does not produce the expected response. The generate function runs into an error instead.
Current behaviour
Runs into an CUDA error
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/4d6490b
Acknowledgements