Open vivekpandian08 opened 4 weeks ago
Hi @vivekpandian08, this sounds very much like a problem I had. You can find the description and the fix here:
https://github.com/benfred/implicit/pull/722
You can either:
I hope that helps. Cheers Jan
Hi @jmorlock ,
Thank you for sharing! I appreciate the reference to #722 and the options for handling this issue.
Currently, I’m using 56 latent factors for my model. Could you provide any insights or recommendations on how to determine the optimal number of factors based on the size of the interaction matrix? Any specific guidelines or resources you’d suggest for tuning this parameter effectively?
Thanks again for the help!
Hi @vivekpandian08 ,
from a theoretical point-of-view you should select a parameter set where your model performance is optimal. Here it is best of to decide on a metric (like precision@k), do a train-test-split (see for example https://benfred.github.io/implicit/api/evaluation.html#implicit.evaluation.train_test_split) and try out different parameter combinations. Either manually, by using a grid search (if model fitting does not take too long) or by using a more sophisticated tool like optuna.
from a practical point of view if you are stuck with the current version from implicit featuring the bug I explained, you must select the number of factors in a way where the integer overflow does not occur:
You can still do a hyperparameter search as explained above but now with this maximum as the upper boundary for the number of factors. In case the optimal value is below that number you are lucky.
I hope that helps. Cheers Jan
Hi @jmorlock ,
Thank you for the detailed explanation!
I was already aware of the theoretical approach to finding the optimal number of latent factors, but your practical method for avoiding integer overflow is really helpful. Setting an upper boundary by calculating based on the matrix size makes perfect sense, especially with the current constraints in the implicit version.
Thanks again !
Hi @jmorlock ,
I’m trying to clone the repository https://github.com/jmorlock/implicit and build it locally, but I’m encountering an error at this step:
[13/34] Generating CXX source implicit/cpu/_als.cxx
I’ve already uninstalled the existing implicit library from my environment to avoid conflicts. Could you provide any guidance on how to resolve this issue? Are there any specific dependencies or configurations I might be missing?
Thanks in advance for your help!
Hi @vivekpandian08, sorry for the late reply.
I am not sure whether I can help you with this error. But I can tell you what I did in order to build implicit.
git clone https://github.com/jmorlock/implicit.git
implicit/gpu/CMakeLists.txt
I added the statement set(CMAKE_CUDA_COMPILER /usr/local/cuda-11.4/bin/nvcc)
in line 13 before enable_language(CUDA)
. The path to nvcc
may be different on your machine. I guess that this could also be done by setting environment variables in your console.python3.9 -m venv ~/venv/implicit
source ~/venv/implicit/bin/activate
pip install --upgrade pip
pip install -r requirements-dev.txt
pip install cmake ninja
python setup.py bdist_wheel
.In case of success you will find a whl
file in the dist
directory, which you can install using pip in the environment where you want to use implicit. Either you uninstall the old version beforehand or you specify --force-reinstall
.
I hope this helps.
Description:
I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.
System Information:
Dataset size: Number of users: 50 million Number of items: 360,000
GPU: NVIDIA A100 (40 GB) Memory Usage: Approximately 13,943 MiB / 40,960 MiB CUDA Version: 12.4
Library Versions: implicit: latest (0.7.2) torch: 2.5.1
Issue Details: When running the model, the following error occurs:
RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) -->model.fit(weighted_matrix) -->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)
This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:
Steps to Reproduce:
Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.
Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.
Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!