KovenYu / MAR

Pytorch code for our CVPR'19 (oral) work: Unsupervised person re-identification by soft multilabel learning
https://kovenyu.com/publication/2019-cvpr-mar/
315 stars 83 forks source link

invalid value #16

Closed sft110 closed 5 years ago

sft110 commented 5 years ago

/home/usr/MAR/src/utils.py:162: RuntimeWarning: invalid value encountered in greater is_positive = p_agree[similar_idx] > self.threshold.item()

as you stated in the previous issue i have reduced the batch size and lr and getting error, how to deal with this error? i am using 2 GPUs of 12GB each. Iter: [900/2481] Freq 213.2 loss_total nan loss_ml nan loss_st nan loss_target nan loss_source nan [2019-06-17 11:23:52]

after first epoch, i am getting nan every time. batchsize=60 & lr= 0.0002.

and when i am trying to run on Rtx 2 GPUs of 24GB each i am getting this error Traceback (most recent call last):

File "src/main.py", line 46, in main() File "src/main.py", line 35, in main meters_trn = trainer.train_epoch(source_loader, target_loader, epoch) File "/home/saif/MAR/src/trainers.py", line 123, in train_epoch multilabels = F.softmax(featurestarget.mm(agents.detach().t()*self.args.scala_ce), dim=1) RuntimeError: set_storage_offset is not allowed on Tensor created from .data or .detach()

i was facing some problems with pytorch& Cuda so i installed nightly.

KovenYu commented 5 years ago

sorry I don't have time to address this now, will turn back to this after iccv rebuttal

sft110 commented 5 years ago

okay i'll be waiting i have tried different lr but getting same nan after first epoch.

sft110 commented 5 years ago

it's running on rtx by setting cudnn.benchmark = false

KovenYu commented 5 years ago

@saiftumrani hi I tried batchsize=60 with lr=2e-5, on two 1080ti, but did not observe your problem after a few epoches:

Iter: [3067/4135] Freq 151.2 loss_source 0.040 loss_st 0.598 loss_ml 7262.415 loss_target 0.443 loss_total 9.878 [2019-06-30 13:05:43]

what is the value of both p_agree[similar_idx] and self.threshold.item() in this warning?: /home/usr/MAR/src/utils.py:162: RuntimeWarning: invalid value encountered in greater is_positive = p_agree[similar_idx] > self.threshold.item()

sft110 commented 5 years ago

it's done, please help me with this initializing centres/threshold ... not found data/ml_Market.dat. computing ml... saving computed ml to data/ml_VeRi.dat Traceback (most recent call last): File "src/main.py", line 46, in main() File "src/main.py", line 35, in main meters_trn = trainer.train_epoch(source_loader, target_loader, epoch) File "/home/saif/MAR1/src/trainers.py", line 95, in train_epoch self.init_losses(target_loader) File "/home/saif/MAR1/src/trainers.py", line 188, in init_losses torch.save((multilabels, views, pairwise_agreements), self.args.ml_path) File "/home/saif/venv/lib/python3.5/site-packages/torch/serialization.py", line 219, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/saif/venv/lib/python3.5/site-packages/torch/serialization.py", line 144, in _with_file_like return body(f) File "/home/saif/venv/lib/python3.5/site-packages/torch/serialization.py", line 219, in return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/saif/venv/lib/python3.5/site-packages/torch/serialization.py", line 292, in _save pickler.dump(obj) OverflowError: cannot serialize a string larger than 4GiB

KovenYu commented 5 years ago

@saiftumrani torch.save uses pickle as its core. From your error message it seems that your pickle version is too out-of-date (see this, in Python 3.4 and pickle 4.0 this 4GB constraint is removed). So please update your pickle version, and ensure your python/pytorch version is correct (I use python3.6 and pytorch 1.0.0).

sft110 commented 5 years ago

i am using pytorch 1.0.0 and pickle protocol 4.0 still facing same problem.

KovenYu commented 5 years ago

@saiftumrani How about using pickle's save function instead of the torch.save? Is pickle's original function okay to save your large file?

sft110 commented 5 years ago

thankyou, facing another problem while training MAR/src/utils.py:162: RuntimeWarning: invalid value encountered in greater is_positive = p_agree[similar_idx] > self.threshold.item()

KovenYu commented 5 years ago

What do you mean? I thought you had addressed this issued as you commented on Jul. 18

sft110 commented 5 years ago

please check your email, the details have been stated in the email.

queenie88 commented 4 years ago

@saiftumrani how do you solve the problem about: RuntimeWarning: invalid value encountered in greater is_positive = p_agree[similar_idx] > self.threshold.item(). Hope your reply.