I fixed the previous problem ( maybe i'm not...) by use ".unsqueeze(1)", but i meet a new problem....T_T.

dapengchen123 / video_reid

We plan to release our code in CVPR 2018 "Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding"

41 stars 13 forks source link

I fixed the previous problem ( maybe i'm not...) by use ".unsqueeze(1)", but i meet a new problem....T_T. #2

Closed AsuradaYuci closed 5 years ago

AsuradaYuci commented 5 years ago

CUDA error after cudaEventDestroy in future dtor: device-side assert triggered. More information in the pictures.

_Originally posted by @AsuradaYuci in https://github.com/dapengchen123/video_reid/issues/1#issuecomment-441043962_

dapengchen123 commented 5 years ago

I haven't met this problem before.

You can first set the train_mode to be "cnn" to get the pre-trained CNN backbone.

then jointly train the CNN-RNN model.

thanks,

AsuradaYuci commented 5 years ago

@dapengchen123 I debug the code today, and I find this problem happened may because the value of loss_ver = nan in trainer.py line 139. And it makes the next line loss = loss_idself.rate + 100loss_ver = nan too. The reason is the variable mask in pairloss.py line23 mask = [0,0,0,0],for example, line 21 variable tar_gallery = [0,0], line 22 variable tar_probe = [ [82],[48], ] , so the mask = tar_probe.expand(N_probe, N_gallery).eq(tar_gallery.expand(N_probe, N_gallery)) will be [0,0,0,0]. And line 36 weights = weights / torch.sum(weights) / 10 = 0/ 0 / 10 = nan Once we used the self.BCE(), it will output loss=nan. I try to solve this problem use following code: screenshot from 2018-11-24 22 01 18 The code can run. I am confused when you run the code,if the loss_ver will be nan? I find it sometimes not be nan, it is a small value,e,g, *1e-026.8398**.

dapengchen123 commented 5 years ago

@AsuradaYuci Do you have solved this problem? If not, I will check the code.

It ran well previously

dapengchen123 commented 5 years ago

@AsuradaYuci It is strange that why mask can be [0,0,0,0]. I use RandomPairSampler for dataloader, so that the N_gallery and N_probe should be equal, the ``mask" should be a square matrix with diagonal elements to be 1.

BTW, tar_gallery and tar_probe should be equal

AsuradaYuci commented 5 years ago

@dapengchen123 I found a problem in RandomPairSampler , from line 62, i is a tensor(for example i = tensor(2021)), and line 65 pid_i = self.index_pid[i] the pid_i will always be 0 , even the i has changed to tensor(1250), pid_i still = 0. I add a new line before line65 (i = int(i), so that i = 2021 ), pid_i = 82, I think 82 is a right value. Now,I get the the ``mask" <class 'list'>: [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1] .

vickyliuliu commented 5 years ago

@dapengchen123 Hello, I'm also working on this project but I'm quite new to cv. Can you explain what's the use of mask in more details? And why tar_gallery and tar_probe should be equal? I do not quite understand.