Closed tiamo405 closed 1 year ago
hey, i am facing the same issue when running the getting started code, did you figure out the problem?
update: just removed MODEL.DEVICE "cuda:0 from the command line
run this:
python3 tools/train_net.py --config-file ./configs/Market1501/bagtricks_R50.yml"
Hi. I was facing the same issue today on the Market1501 dataset, and in my case I think it had to do with self.optimizer.zero_grad()
(found in _engine/trainloop.py). After changing this to self.optimizer.zero_grad(set_to_none=False)
I started getting a different error, related to my GPU.
The problem with my GPU was that it was not able to allocate enough memory. I managed to fix this by adding a parameter to the autocast, in the same file, as follows with autocast(dtype=torch.float16):
. Moreover, I also played around with the IMS_PER_BATCH in Base-bagtricks.yml. I had to reduce the batch size for both the SOLVER and the TEST fields. I also added torch.cuda.empty_cache()
right after the self.grad_scaler.update()
, but I don't think it is necessary. Emptying the cache ensured it worked during training but it also increased training time.
I am quite new to this so I am not sure if this was the best approach but I hope it helps.
Hi. I was facing the same issue today on the Market1501 dataset, and in my case I think it had to do with
self.optimizer.zero_grad()
(found in _engine/trainloop.py). After changing this toself.optimizer.zero_grad(set_to_none=False)
I started getting a different error, related to my GPU.The problem with my GPU was that it was not able to allocate enough memory. I managed to fix this by adding a parameter to the autocast, in the same file, as follows
with autocast(dtype=torch.float16):
. Moreover, I also played around with the IMS_PER_BATCH in Base-bagtricks.yml. I had to reduce the batch size for both the SOLVER and the TEST fields. I also addedtorch.cuda.empty_cache()
right after theself.grad_scaler.update()
, but I don't think it is necessary. Emptying the cache ensured it worked during training but it also increased training time.I am quite new to this so I am not sure if this was the best approach but I hope it helps.
hey, i am facing the same issue when running the getting started code, did you figure out the problem?
update: just removed MODEL.DEVICE "cuda:0 from the command line
run this:
python3 tools/train_net.py --config-file ./configs/Market1501/bagtricks_R50.yml"
my workaround was to downgrade pytorch version.
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
then the program works normally
In my case adding set_to_none
parameter to optimizer.zero_grad
fixed the problem: self.optimizer.zero_grad(set_to_none=False)
.
Hi everyone, I also encountered this issue in Pytorch 2.0, and after debugging the code, I found a simple solution that does not require lowering the Pytorch version。 Just need to Set 'continuous=False' in line 380 of defaults.py
return build_optimizer(cfg, model)
# convert to
return build_optimizer(cfg, model, continuous=False)
In this way, it is avoided to convert the network's params to ContiguousParams.
I debugged code and found that if using ContiguousParams, after loss.backword() , the calculated gradient is None, and leading to training errors.
After setting'continuous=False', the network can train smoothly and may achieve higher precision : )
@huiyiygy Thanks for this! Minor typo, it should be contiguous=False
, not continuous
. Otherwise this seems to work and while you can train with AMP off, I get pretty terrible loss curves. With this set, loss immediately decreases as expected.
file Base-SBS.yml
error:
code run :
CUDA_VISIBLE_DEVICES=0 python tools/train_net.py \ --config-file ./configs/Market1501/sbs_R50.yml \ MODEL.DEVICE "cuda:0"
can someone help me? thanks