SineZHAN / deepALplus

This is a toolbox for Deep Active Learning, an extension from previous work https://github.com/ej0cl6/deep-active-learning (DeepAL toolbox).
MIT License
165 stars 24 forks source link

Stuck at net to device #4

Open LuGeNat opened 1 year ago

LuGeNat commented 1 year ago

Dear Repository owners,

I would like to use your deepALplus to do experiments with Deep Active Learning. However, I always get stuck at line 23 in nets.py. It takes ages to execute but should normally be milliseconds: self.clf = self.net(dim = dim, pretrained = self.params['pretrained'], num_classes = self.params['num_class']).to(self.device) The script fails after ~20 min with RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Do you have any recommendations? Executing the code in https://github.com/ej0cl6/deep-active-learning works fine for me. Could this be because of cudnn and cuda versions? Are the certain versions one has to use?

I installed using conda. I used cudnn/8.0_v7.0 and cuda/11.0.2 as well as cudnn/11.7_v8.6 amd cuda/11.7.0 and had the same behavior with both.

Thank you.

SineZHAN commented 1 year ago

Possible reasons:

  1. CUDNN is not installed 2.pytorch and cuda versions do not match. Specifically, the version of cuda and the environment cuda version at pvtorch compile time are inconsistent.
  2. The graphics card is incompatible with the installed CUDA and CUDNN versions, for example, the 2080 requires at least cuda9.2 and above to run well.
  3. The memory is insufficient and the dataloder processes too much data each time
  4. Insufficient video memory, OOM. Sometimes when a program calls cuDNN with insufficient video memory, it may report cuDNN error instead of OOM

If cuDNN error is reported when the code is just running, it should be the first three reasons. If an error occurs after running for a period of time, it should be 4 or 5.