Closed Wadha-Almattar closed 1 year ago
One thing to add, I'm using a small dataset to process just to insure that the code is working fine
After encountering several issues to start training and trying to debug the code, I believe that running the code on a single GPU is not straight forward. Your guidance is highly appreciated in this matter.
The error message "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group" typically occurs in distributed computing scenarios when using PyTorch's distributed data parallel (DDP) or other distributed training functionalities.
Thank you for your interest in our work and pointing out this issue. The error is caused by invoking the function for distributed computing while operating in single GPU mode. I have made the necessary updates to the code, and it should now work. If you encounter any further difficulties or have any other concerns, please feel free to contact me.
Hi there, I am trying to run the code as it's explained in the README file. I am using RTX 3090 Ti to run the code in single GPU, but I'm getting an error that I couldn't figure it out for days. I reduced the number of workers, epochs and warmup-epochs just to get the code working. Also, I did all the preprocessing step mentioned in the README and it works fine. I did face some incompatibility issue regarding torch, torchvision and cuda versions, but I solved it. Now, I''m struggling with this issue.
Your help is appreciated
I run this: python main.py --num-workers 5 --arch ViT-S-p16 --batch-size 512 --epochs 10 --warmup-epochs 2 --data-index /home/gus er1/SSiT/data_index/pretraining_dataset.pkl --save-path /home/guser1/SSiT/checkpoints
================================= arch: ViT-S-p16 data_index: ./data_index/pretraining_dataset.pkl save_path: /home/guser1/SSiT/checkpoints record_path: None pretrained: False device: cuda seed: -1 resume: False distributed: False backend: nccl nodes: 1 n_gpus: None addr: 127.0.0.1 port: 28888 rank: 0 input_size: 224 start_epoch: 0 epochs: 10 warmup_epochs: 2 mask_ratio: 0.25 disable_progress: False ss: 10 ss_decay: False cl: 1 saliency_threshold: 0.5 batch_size: 512 optimizer: ADAMW moco_m: 0.99 temperature: 0.2 learning_rate: 0.001 momentum: 0.9 weight_decay: 0.1 num_workers: 5 save_interval: 20 pool_mode: max dataset_ratio: 1.0
=============== Single GPU mode
=============================== Number of training samples: 999
0it [00:00, ?it/s] Traceback (most recent call last): File "main.py", line 140, in
main()
File "main.py", line 88, in main
worker(0, n_gpus, args)
File "main.py", line 121, in worker
train(
File "/home/guser1/SSiT/train.py", line 36, in train
for step, train_data in progress:
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/guser1/SSiT/data.py", line 64, in getitem
img_stu, img_tea, mask_stu, mask_tea = self.transform(img, mask)
File "/home/guser1/SSiT/data.py", line 149, in call
img_stu, mask_stu = self.rotation_with_mask(self.rotation_stu, img_stu, mask_stu, self.p_rotation_stu)
File "/home/guser1/SSiT/data.py", line 179, in rotation_with_mask
img = F.rotate(img, angle, tf.resample, tf.expand, tf.center, tf.fill)
File "/home/guser1/anaconda3/envs/ssit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'RandomRotation' object has no attribute 'resample'