Closed zhujilin1995 closed 1 year ago
Hi Zhu,
It has been a while since I worked on this code.
As far as I remember, create a new Dataset module (e.g., imagenet
) inside simsiam/data
following CIFAR module.
Import the new dataset class inside init. E.g., from data.imagenet import ImageNet
provide the set
argument to indicate your new Dataset (e.g., ImageNet)
I hope this helps
Thank you for your reply. I came acoss another problem when I ran the pretrain_main.py, the following errors are shown:
_Train Running basic DDP example on rank 0. Process Process-2: Traceback (most recent call last): File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 315, in _bootstrap self.run() File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 108, in run self._target(*self._args, **self._kwargs) File "F:\Zhujilin\SimSiam\pretrain_main.py", line 66, in train_ddp setup(rank, cfg.world_size, start_port) File "F:\Zhujilin\SimSiam\pretrain_main.py", line 28, in setup dist.init_process_group( File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group _update_default_pg(_new_process_group_helper( File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 588, in _new_process_grouphelper pg = ProcessGroupGloo( RuntimeError
I have tried looking up some solutions in the Internet, but I still cannot find appropriate solutions. Could you please give me some advice to deal with this error? I will appreciate you too much.
Seems like a DDP initialization error. Try to run the code on a single GPU first and see if the error persists. You can do so either by
train_ddp
. Basically, call train_ddp
instead of spawn_train
. Make sure to pass the right parameters to train_ddp
Thank you, I have settled this issue.
However, I have another problem. When I finished the 799 epoch, there occured an error. I polished the code, and I use the 799-epoch.state profile to retrained. But the training accuracy dropped from 89% to 50%. This is different from what I expected. I thought the previous training accuracy would continue.
So, why did this happen, and how should we handle this situation? I'm a beginner in coding. My questions might be bothersome, and I apologize for that. Thank you very much for your answers.
The training information is as follows:
I am not sure if you are referring to the pre-training (pretrain_main.py) or fine-tuning (classifier_main.py) stage?
If you are referring to pre-training stage, you should set either resume
or pretrained
.
I don't fully remember the difference between resume
or pretrained
, but at least one difference is that pretrained
will assume you are training from scratch, i.e., start_epoch=0. Contrary, resume
will set start_epoch correctly to resume from where the failure happened. In this case, the lr scheduler
should resume with the correct learning rate, i.e., the one at the crash point and not the initial lr.
If I were you, I would make sure the resume code is executed correctly and both the start_epoch and the lr reflect the state where failure happened.
A similar logic applies with the fine-tuning (classifier_main.py) stage. Basically, make sure your code executes this line. Also double check both the start_epoch and the lr.
I hope this helps
This helps a lot, thank you
Hello, in your code, the dataset of cifar10 is used. But the file of cifar10 python is specialized. I want to use my own pictures to train, how do I prepare? Thank you very much.