Closed ghalib2021 closed 3 years ago
@ghalib2021 Hi Ahmed, can you please send me what your eval_resnet_8gpu_transfer_in1k_fulltune.yaml looks like?
@iseessel i am also facing this issue on this example of running with SimCLR as my main objective is to train unsupervised library by your work. I have shared the link of my colab. I have done exact same settings as above successfully installed all the libraries but still facing issue when following the exact tutorial. I hope your help me in this regard. I will be really thankful to you. I have pasted the link of yaml file which i downloaded https://dl.fbaipublicfiles.com/vissl/tutorials/configs/quick_1gpu_resnet50_simclr.yaml
https://colab.research.google.com/drive/1Dayb-GkoidpxiogWbOKW8uqHyN6NpPsY?usp=sharing
se}
Traceback (most recent call last):
File "run_distributed_engines.py", line 194, in
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/content/run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "/content/run_distributed_engines.py", line 159, in process_main hook_generator=hook_generator, File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 84, in train_main torch.cuda.set_device(local_rank) File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal
It seems the root cause of both of the problems are that the training is not reading the configs properly.
@aiapps2020 your problem is that the config is trying to run a training on 8gpus when there is only 1 available.
@ghalib2021 your problem is that the config is not reading any of the transforms (in particular the img is not being converted to a tensor) and is erroring out.
@iseessel i double cross check the config file number of gpus paramter is one also i followed the exact tutorial even downloaded the whl file exact same and make sure that there is no conflict in version. I have attached my yaml file which i downloaded and giveen the path also the google colab file. if you help me in verifying where i went wrong i will be thank ful to you https://dl.fbaipublicfiles.com/vissl/tutorials/configs/quick_1gpu_resnet50_simclr.yaml
Hi there @aiapps2020 and @ghalib2021 The issue is that hydra 1.1 is installed, which has a breaking change that causes issues with reading your configs.
Can you downgrade your hydra version to 1.0.6?
@iseessel Thank u very much let me try
Thanku very much @iseessel already can :)
Instructions To Reproduce the 🐛 Bug:
git diff
) or what code you wrote At first, thanks to the author for this work. I tried to run vissl tutorial Benchmark Full-Finetuning on ImageNet-1K on google colab. The tutorial is using old pytorch version 1.5.2+cudnn However the current version of pytorch on google colab is different. Secondly vissl new library version 0.5.1 requires pytorch greater then or equal to 1.6.0 So i used the following commands to configure the environment on google colab. pytorch installation !pip install torch==1.8.1+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html opencv !pip install opencv-python vissl !pip install vissl apex !git clone https://github.com/NVIDIA/apex !git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 !python setup.py install --cuda_ext2. what exact command you run: Following that after following the steps in this tutorial of VISSL Benchmark Full-Finetuning on ImageNet-1K. when i run this command for finetning after successfully registering dummay_data !python run_distributed_engines.py \ hydra.verbose=true \ config=eval_resnet_8gpu_transfer_in1k_fulltune \ config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \ config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \ config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 \ config.DATA.TEST.DATA_SOURCES=[disk_folder] \ config.DATA.TEST.LABEL_SOURCES=[disk_folder] \ config.DATA.TEST.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TEST.BATCHSIZE_PER_REPLICA=2 \ config.OPTIMIZER.num_epochs=2 \ config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] \ config.OPTIMIZER.param_schedulers.lr.milestones=[1] \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.CHECKPOINT.DIR="./checkpoints" \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/resnet50-19c8e357.pth" \ config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk._feature_blocks." \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""
08-11 05:20:16,063 trainer_main.py: 167: Loss is: CrossEntropyLoss() INFO 2021-08-11 05:20:16,063 trainer_main.py: 168: Starting training.... INFO 2021-08-11 05:20:16,064 init.py: 72: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10, 'total_size': 10, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) Traceback (most recent call last): File "run_distributed_engines.py", line 194, in
hydra_main(overrides=overrides)
File "run_distributed_engines.py", line 179, in hydra_main
hook_generator=default_hook_generator,
File "run_distributed_engines.py", line 123, in launch_distributed
hook_generator=hook_generator,
File "run_distributed_engines.py", line 166, in _distributed_worker
process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id)
File "run_distributed_engines.py", line 159, in process_main
hook_generator=hook_generator,
File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 102, in train_main
trainer.train()
File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 171, in train
self._advance_phase(task) # advances task.phase_idx
File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 286, in _advance_phase
phase_type, epoch=task.phase_idx, compute_start_iter=compute_start_iter
File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 501, in recreate_data_iterator
self.data_iterator = iter(self.dataloaders[phase_type])
File "/usr/local/lib/python3.7/dist-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 40, in iter
self.preload()
File "/usr/local/lib/python3.7/dist-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 46, in preload
self.cache_next = next(self._iter)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
return [default_collate(samples) for samples in transposed]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 83, in
return [default_collate(samples) for samples in transposed]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 85, in default_collate
raise TypeError(default_collate_err_msg_format.format(elem_type))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.Image.Image'>
Please help me in this regard i will be thankfull to you