RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)

vivekpandian08 commented 4 weeks ago

Description:

I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.

System Information:

Dataset size: Number of users: 50 million Number of items: 360,000

GPU: NVIDIA A100 (40 GB) Memory Usage: Approximately 13,943 MiB / 40,960 MiB CUDA Version: 12.4

Library Versions: implicit: latest (0.7.2) torch: 2.5.1

Issue Details: When running the model, the following error occurs:

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) -->model.fit(weighted_matrix) -->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)

This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:

Restarted the kernel to clear any lingering memory states.
Checked that CUDA version 12.4 is compatible with the library requirements.
Verified no conflicting paths for CUDA libraries in LD_LIBRARY_PATH.

Steps to Reproduce:

Set up a dataset with 50 million users and 360,000 items.
Run implicit.gpu.als on this dataset.
Monitor GPU memory usage and error occurrence.

Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.

Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.

Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!

jmorlock commented 2 weeks ago

Hi @vivekpandian08, this sounds very much like a problem I had. You can find the description and the fix here:

https://github.com/benfred/implicit/pull/722

You can either:

reduce factors (in your case 80 40 should work) or
try my fix (you will have to build implicit yourself though)

I hope that helps. Cheers Jan

vivekpandian08 commented 2 weeks ago

Hi @jmorlock ,

Thank you for sharing! I appreciate the reference to #722 and the options for handling this issue.

Currently, I’m using 56 latent factors for my model. Could you provide any insights or recommendations on how to determine the optimal number of factors based on the size of the interaction matrix? Any specific guidelines or resources you’d suggest for tuning this parameter effectively?

Thanks again for the help!

jmorlock commented 2 weeks ago

Hi @vivekpandian08 ,

from a theoretical point-of-view you should select a parameter set where your model performance is optimal. Here it is best of to decide on a metric (like precision@k), do a train-test-split (see for example https://benfred.github.io/implicit/api/evaluation.html#implicit.evaluation.train_test_split) and try out different parameter combinations. Either manually, by using a grid search (if model fitting does not take too long) or by using a more sophisticated tool like optuna.

from a practical point of view if you are stuck with the current version from implicit featuring the bug I explained, you must select the number of factors in a way where the integer overflow does not occur:

calculate the size of the customer-item-matrix and take the maximum of both values (either number of items or number of customers). In your case it is 50M.
Divide 2147483647 by that number and round down. In your case I get 42 (sorry that I said 80 yesterday, I was thinking about an unsigned int). This is the maximal number of factors you are able to use.

You can still do a hyperparameter search as explained above but now with this maximum as the upper boundary for the number of factors. In case the optimal value is below that number you are lucky.

I hope that helps. Cheers Jan

vivekpandian08 commented 1 week ago

Hi @jmorlock ,

Thank you for the detailed explanation!

I was already aware of the theoretical approach to finding the optimal number of latent factors, but your practical method for avoiding integer overflow is really helpful. Setting an upper boundary by calculating based on the matrix size makes perfect sense, especially with the current constraints in the implicit version.

Thanks again !

vivekpandian08 commented 1 week ago

Hi @jmorlock ,

I’m trying to clone the repository https://github.com/jmorlock/implicit and build it locally, but I’m encountering an error at this step:

[13/34] Generating CXX source implicit/cpu/_als.cxx

I’ve already uninstalled the existing implicit library from my environment to avoid conflicts. Could you provide any guidance on how to resolve this issue? Are there any specific dependencies or configurations I might be missing?

Thanks in advance for your help!

jmorlock commented 5 days ago

Hi @vivekpandian08, sorry for the late reply.

I am not sure whether I can help you with this error. But I can tell you what I did in order to build implicit.

git clone https://github.com/jmorlock/implicit.git
In implicit/gpu/CMakeLists.txt I added the statement set(CMAKE_CUDA_COMPILER /usr/local/cuda-11.4/bin/nvcc) in line 13 before enable_language(CUDA). The path to nvcc may be different on your machine. I guess that this could also be done by setting environment variables in your console.

I created a virtual environment specifically for building implicit:

python3.9 -m venv ~/venv/implicit
source ~/venv/implicit/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install cmake ninja

I built implicit using python setup.py bdist_wheel.

In case of success you will find a whl file in the dist directory, which you can install using pip in the environment where you want to use implicit. Either you uninstall the old version beforehand or you specify --force-reinstall.

I hope this helps.

benfred / implicit

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725