SaraBabakN / MFCL-NeurIPS23

15 stars 4 forks source link

RuntimeError: CUDA error: device-side assert triggered on tinyimagenet #3

Closed smart0eddie closed 7 months ago

smart0eddie commented 7 months ago

Hi I got the CUDA error on tinyimagenet at dw_KD = self.dw_k[-1 * torch.ones(len(kd_index),).long()].to(self.device)

pytorch\aten\src\ATen\native\cuda\IndexKernel.cu:92: block: [0,0,0], thread: [3,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

It's pretty weird that this happens in the middle of the training

The following is the training log round 9, accuracy: 14.199999809265137 round 19, accuracy: 21.0 round 29, accuracy: 27.600000381469727 round 39, accuracy: 32.099998474121094 round 49, accuracy: 35.099998474121094 round 59, accuracy: 38.5 round 69, accuracy: 42.70000076293945 round 79, accuracy: 43.099998474121094 round 89, accuracy: 43.70000076293945 round 99, accuracy: 47.20000076293945 round 109, accuracy: 19.649999618530273 round 119, accuracy: 24.350000381469727 round 129, accuracy: 26.049999237060547 round 139, accuracy: 28.0 round 149, accuracy: 28.399999618530273 round 159, accuracy: 30.049999237060547 round 169, accuracy: 30.649999618530273 round 179, accuracy: 31.149999618530273 round 189, accuracy: 31.200000762939453 round 199, accuracy: 31.600000381469727 total_accuracy_1: [tensor(24.8000), tensor(38.4000)] round 209, accuracy: 17.83333396911621 round 219, accuracy: 20.200000762939453 round 229, accuracy: 22.5 round 239, accuracy: 23.133333206176758 round 249, accuracy: 24.066667556762695 round 259, accuracy: 24.83333396911621 round 269, accuracy: 24.733333587646484 round 279, accuracy: 24.933332443237305 round 289, accuracy: 25.433332443237305 round 299, accuracy: 25.133333206176758 total_accuracy_2: [tensor(25.4000), tensor(7.9000), tensor(42.1000)] round 309, accuracy: 17.700000762939453 round 319, accuracy: 18.399999618530273 round 329, accuracy: 19.174999237060547 round 339, accuracy: 19.850000381469727 round 349, accuracy: 19.899999618530273 round 359, accuracy: 20.950000762939453 round 369, accuracy: 21.174999237060547 round 379, accuracy: 21.0 round 389, accuracy: 21.475000381469727 round 399, accuracy: 21.725000381469727 total_accuracy_3: [tensor(22.3000), tensor(6.6000), tensor(24.3000), tensor(33.7000)] round 409, accuracy: 16.020000457763672 round 419, accuracy: 16.979999542236328 round 429, accuracy: 17.81999969482422 round 439, accuracy: 18.280000686645508 round 449, accuracy: 18.479999542236328 round 459, accuracy: 19.15999984741211 round 469, accuracy: 19.280000686645508 round 479, accuracy: 19.1200008392334 round 489, accuracy: 19.0 round 499, accuracy: 19.600000381469727 total_accuracy_4: [tensor(20.7000), tensor(5.5000), tensor(15.6000), tensor(13.3000), tensor(42.9000)] round 509, accuracy: 15.933333396911621 round 519, accuracy: 16.983333587646484 round 529, accuracy: 17.266666412353516 round 539, accuracy: 17.733333587646484 round 549, accuracy: 17.433332443237305 round 559, accuracy: 18.183332443237305 round 569, accuracy: 17.71666717529297 round 579, accuracy: 17.983333587646484 round 589, accuracy: 17.83333396911621 round 599, accuracy: 18.116666793823242 total_accuracy_5: [tensor(20.7000), tensor(5.), tensor(13.8000), tensor(7.2000), tensor(19.3000), tensor(42.7000)]

smart0eddie commented 7 months ago

I think I found the problem

t=6 epoch=0 i=3

line 50 in MFCL.py

mappings = torch.ones(y_com.size(), dtype=torch.float32, device='cuda')
dw_cls = mappings[y_com.long()]

y_com has only 132 elements while there are 140 classes. mapping is for class not for sample It should not be initialized based on the number of samples

mappings = torch.ones(self.valid_dim, dtype=torch.float32, device='cuda') should fix the bug

(actually this can be removed as it contains all 1)

SaraBabakN commented 7 months ago

Thank you so much for pointing this out. I will check and update the code if it is required. About the mapping, you are right, but there are some other algorithms that give more weight to the current classes using this parameter. Since we are not doing that, I will try to remove it safely from the code.

SaraBabakN commented 7 months ago

I think the bug has to be fixed now.