Wadha-Almattar commented 1 year ago

Hi there, I am trying to run the code as it's explained in the README file. I am using RTX 3090 Ti to run the code in single GPU, but I'm getting an error that I couldn't figure it out for days. I reduced the number of workers, epochs and warmup-epochs just to get the code working. Also, I did all the preprocessing step mentioned in the README and it works fine. I did face some incompatibility issue regarding torch, torchvision and cuda versions, but I solved it. Now, I''m struggling with this issue.

Your help is appreciated

I run this: python main.py --num-workers 5 --arch ViT-S-p16 --batch-size 512 --epochs 10 --warmup-epochs 2 --data-index /home/gus er1/SSiT/data_index/pretraining_dataset.pkl --save-path /home/guser1/SSiT/checkpoints

================================= arch: ViT-S-p16 data_index: ./data_index/pretraining_dataset.pkl save_path: /home/guser1/SSiT/checkpoints record_path: None pretrained: False device: cuda seed: -1 resume: False distributed: False backend: nccl nodes: 1 n_gpus: None addr: 127.0.0.1 port: 28888 rank: 0 input_size: 224 start_epoch: 0 epochs: 10 warmup_epochs: 2 mask_ratio: 0.25 disable_progress: False ss: 10 ss_decay: False cl: 1 saliency_threshold: 0.5 batch_size: 512 optimizer: ADAMW moco_m: 0.99 temperature: 0.2 learning_rate: 0.001 momentum: 0.9 weight_decay: 0.1 num_workers: 5 save_interval: 20 pool_mode: max dataset_ratio: 1.0

=============== Single GPU mode

=============================== Number of training samples: 999

0it [00:00, ?it/s] Traceback (most recent call last): File "main.py", line 140, in main() File "main.py", line 88, in main worker(0, n_gpus, args) File "main.py", line 121, in worker train( File "/home/guser1/SSiT/train.py", line 36, in train for step, train_data in progress: File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise raise exception AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/guser1/SSiT/data.py", line 64, in getitem img_stu, img_tea, mask_stu, mask_tea = self.transform(img, mask) File "/home/guser1/SSiT/data.py", line 149, in call img_stu, mask_stu = self.rotation_with_mask(self.rotation_stu, img_stu, mask_stu, self.p_rotation_stu) File "/home/guser1/SSiT/data.py", line 179, in rotation_with_mask img = F.rotate(img, angle, tf.resample, tf.expand, tf.center, tf.fill) File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'RandomRotation' object has no attribute 'resample'

Wadha-Almattar commented 1 year ago

One thing to add, I'm using a small dataset to process just to insure that the code is working fine

Wadha-Almattar commented 1 year ago

After encountering several issues to start training and trying to debug the code, I believe that running the code on a single GPU is not straight forward. Your guidance is highly appreciated in this matter.

The error message "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group" typically occurs in distributed computing scenarios when using PyTorch's distributed data parallel (DDP) or other distributed training functionalities.

YijinHuang commented 1 year ago

Thank you for your interest in our work and pointing out this issue. The error is caused by invoking the function for distributed computing while operating in single GPU mode. I have made the necessary updates to the code, and it should now work. If you encounter any further difficulties or have any other concerns, please feel free to contact me.

YijinHuang / SSiT

Single GPU settings #4

Your help is appreciated

=============== Single GPU mode

=============================== Number of training samples: 999