AlfredXiangWu / LightCNN

A Light CNN for Deep Face Representation with Noisy Labels, TIFS 2018
https://arxiv.org/abs/1511.02683
MIT License
1.01k stars 166 forks source link

THCudaCheck FAIL #5

Closed Bananajia closed 7 years ago

Bananajia commented 7 years ago

Hi @AlfredXiangWu ,

my training datafile like this: /home/jonanza/datasets/casia/CASIA144/3599667/037.png 0 /home/jonanza/datasets/casia/CASIA144/1466221/082.png 1 /home/jonanza/datasets/casia/CASIA__144/1466221/044.png 1 ...

and training ... it broke here: Epoch: [0][3500/3568] Time 0.108 (0.112) Data 0.000 (0.000) Loss 8.8549 (9.2667) Prec@1 0.000 (0.000) Prec@5 0.000 (0.008)

/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype , int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered Traceback (most recent call last): File "train.py", line 279, in main() File "train.py", line 136, in main train(train_loader, model, criterion, optimizer, epoch) File "train.py", line 175, in train print(loss.data[0], input.size(0)) RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/generic/THCStorage.c:32

I searched the problem, and they said it was beyond indexes, i have no idea about this Orz

AlfredXiangWu commented 7 years ago

It seems device-side assert error. This issue https://github.com/pytorch/pytorch/issues/1010 may help you.

Maybe, there are some errors on your label setting, for example, the number of classes is 10575, but you give the setting --num_classes=10574.

Bananajia commented 7 years ago

Thanks for your help~ Firstly I used label from 1 to 10575, it did not work with the same reason. And I changed the label from 0 to 10574, I forgot the number of classes was 10574, then I changed the --num_classes = 10575, it did work~~ Thank you~!