Closed PaulineJac closed 2 years ago
Hi, I suspect the module torch does not have a compiled module for A100 GPUs. Could you try to either run the task on Yggdrasil or add the parameter --exclude=gpu[020,022] when requesting resources to avoid using A100 GPUs ? If this solve the problem, we'll have to see if we can manually compile this module.
edit : you can check the GPUs types of the compute nodes here : https://doc.eresearch.unige.ch/hpc/hpc_clusters#compute_nodes
Should add this parameter to the submission script then? Which should start with:
#!/bin/env bash
#SBATCH --partition=shared-gpu
#SBATCH --time=01:00:00
#SBATCH --gpus=1
#SBATCH --output=kraken-%j.out
#SBATCH --mem=0
#SBATCH --exclude=gpu[020,022]
Yes, or use Yggdrasil while it seems that torch works on all GPUs of Yggdrasil
Thank you very much, everything works with Yggdrasil.
I just trained my first model 🥳!
I had this error when I was trying to train my model :