ahmdtaha / simsiam

Pytorch implementation of Exploring Simple Siamese Representation Learning
19 stars 8 forks source link

How to prepare my own dataset? #8

Closed zhujilin1995 closed 1 year ago

zhujilin1995 commented 1 year ago

Hello, in your code, the dataset of cifar10 is used. But the file of cifar10 python is specialized. I want to use my own pictures to train, how do I prepare? Thank you very much.

ahmdtaha commented 1 year ago

Hi Zhu, It has been a while since I worked on this code. As far as I remember, create a new Dataset module (e.g., imagenet) inside simsiam/data following CIFAR module. Import the new dataset class inside init. E.g., from data.imagenet import ImageNet

provide the set argument to indicate your new Dataset (e.g., ImageNet) I hope this helps

zhujilin1995 commented 1 year ago

Thank you for your reply. I came acoss another problem when I ran the pretrain_main.py, the following errors are shown:

_Train Running basic DDP example on rank 0. Process Process-2: Traceback (most recent call last): File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 315, in _bootstrap self.run() File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 108, in run self._target(*self._args, **self._kwargs) File "F:\Zhujilin\SimSiam\pretrain_main.py", line 66, in train_ddp setup(rank, cfg.world_size, start_port) File "F:\Zhujilin\SimSiam\pretrain_main.py", line 28, in setup dist.init_process_group( File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group _update_default_pg(_new_process_group_helper( File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 588, in _new_process_grouphelper pg = ProcessGroupGloo( RuntimeError

I have tried looking up some solutions in the Internet, but I still cannot find appropriate solutions. Could you please give me some advice to deal with this error? I will appreciate you too much.

ahmdtaha commented 1 year ago

Seems like a DDP initialization error. Try to run the code on a single GPU first and see if the error persists. You can do so either by

  1. Set the world_size = 1
  2. Skip DDP setup altogether and go directly to train_ddp. Basically, call train_ddp instead of spawn_train. Make sure to pass the right parameters to train_ddp
zhujilin1995 commented 1 year ago

Thank you, I have settled this issue.

However, I have another problem. When I finished the 799 epoch, there occured an error. I polished the code, and I use the 799-epoch.state profile to retrained. But the training accuracy dropped from 89% to 50%. This is different from what I expected. I thought the previous training accuracy would continue.

So, why did this happen, and how should we handle this situation? I'm a beginner in coding. My questions might be bothersome, and I apologize for that. Thank you very much for your answers.

The training information is as follows: 图片

ahmdtaha commented 1 year ago

I am not sure if you are referring to the pre-training (pretrain_main.py) or fine-tuning (classifier_main.py) stage?

If you are referring to pre-training stage, you should set either resume or pretrained.

I don't fully remember the difference between resume or pretrained, but at least one difference is that pretrained will assume you are training from scratch, i.e., start_epoch=0. Contrary, resume will set start_epoch correctly to resume from where the failure happened. In this case, the lr scheduler should resume with the correct learning rate, i.e., the one at the crash point and not the initial lr.

If I were you, I would make sure the resume code is executed correctly and both the start_epoch and the lr reflect the state where failure happened.

A similar logic applies with the fine-tuning (classifier_main.py) stage. Basically, make sure your code executes this line. Also double check both the start_epoch and the lr.

I hope this helps

zhujilin1995 commented 1 year ago

This helps a lot, thank you