Kernel Version Compatibility Issue During Sentence Transformers Fine-Tuning

okamiRvS commented 4 months ago

Issue:

Description

I've been following the official tutorial to finetune an embedding model using Sentence Transformers v3. While setting up the training as described, I encountered a critical warning related to the kernel version that may affect the training process.

Error Details

When initiating the SentenceTransformerTrainer, the process detects an incompatible kernel version which is below the recommended minimum, potentially causing the training to hang:

# Code snippet that generates the warning
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# Console output
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

[ 4/1563 00:02 < 28:14, 0.92 it/s, Epoch 0.00/1]
Step Training Loss Validation Loss
RuntimeError Traceback (most recent call last)
Cell In[8], line 10
1 # 7. Create a trainer & train
2 trainer = SentenceTransformerTrainer(
3 model=model,
4 args=args,
(...)
8 evaluator=dev_evaluator,
9 )
---> 10 trainer.train()

File ~/miniconda3/envs/finetuning_3_x_sentence_transformers/lib/python3.10/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1883 hf_hub_utils.enable_progress_bars()
1884 else:
-> 1885 return inner_training_loop(
1886 args=args,
1887 resume_from_checkpoint=resume_from_checkpoint,
1888 trial=trial,
1889 ignore_keys_for_eval=ignore_keys_for_eval,
1890 )

File ~/miniconda3/envs/finetuning_3_x_sentence_transformers/lib/python3.10/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2213 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2215 with self.accelerator.accumulate(model):
...
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Environment Details

Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
Kernel Version: Linux 4.18.0-513.24.1.el8_9.x86_64
Python Version: Python 3.10
Dependencies:
- torch (nightly build)
- torchvision
- torchaudio
- sentence-transformers==3.0.1
- datasets==2.20.0
- accelerate==0.31.0
- jupyterlab
- ipywidgets

I set up my environment with the following commands:

conda create -n=finetuning_3_x_sentence_transformers python=3.10 -y
conda activate finetuning_3_x_sentence_transformers
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
pip install sentence-transformers==3.0.1 datasets==2.20.0 accelerate==0.31.0 jupyterlab ipywidgets

Questions

Is it mandatory to update the Linux kernel to version 5.5.0 or higher to avoid potential training issues, or are there any recommended workarounds that can ensure compatibility with the existing kernel version?
Could the project documentation be updated to include specific kernel version requirements or suggestions for settings to handle such compatibility issues?

Additional Information on CUDA Compatibility

Based on the NVIDIA CUDA Installation Guide for Linux 12.4, the "Native Linux Distribution Support in CUDA 12.4" table explicitly lists support for Red Hat Enterprise Linux 8.y (where y ≤ 9) with kernel version 4.18.0-513. This appears to create a strong inconsistency because, while NVIDIA supports this kernel version for CUDA 12.4, the Sentence Transformers training process recommends a kernel upgrade to at least version 5.5.0 to prevent potential issues.

This discrepancy is concerning as it suggests a possible incompatibility between the recommended setups for CUDA and the Sentence Transformers library. It's crucial to clarify whether the kernel upgrade recommendation can be reconciled with NVIDIA’s supported configurations, or if there are additional settings or modifications recommended for users in similar environments.

Could the documentation or the error messaging in the Sentence Transformers library be adjusted to address this potential conflict, providing clear guidance for users operating under NVIDIA’s supported kernels?

Thank you for any advice or updates you can provide to help address this kernel version issue!

tomaarsen commented 4 months ago

Hello!

Judging by the PyTorch Getting Started, CUDA 12.4 support still seems to be premature/not fully ready.

Personally, I am also on Torch compiled with CUDA 12.1. My recommendation would be to install that instead, and see if you have more luck there. That prevents you from having to upgrading your Linux kernel (which Sentence Transformers would rather not make mandatory).

Tom Aarsen

okamiRvS commented 1 month ago

I no longer get the error, but I just reinstalled the same environment but recently. Thanks for the support.

UKPLab / sentence-transformers