Bug in Benchmark Full-Finetuning on ImageNet-1K

ghalib2021 commented 3 years ago

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote At first, thanks to the author for this work. I tried to run vissl tutorial Benchmark Full-Finetuning on ImageNet-1K on google colab. The tutorial is using old pytorch version 1.5.2+cudnn However the current version of pytorch on google colab is different. Secondly vissl new library version 0.5.1 requires pytorch greater then or equal to 1.6.0 So i used the following commands to configure the environment on google colab. pytorch installation !pip install torch==1.8.1+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html opencv !pip install opencv-python vissl !pip install vissl apex !git clone https://github.com/NVIDIA/apex !git checkout 4a1aa97e31ca87514e17c3cd3bbc03f4204579d0 !python setup.py install --cuda_ext

2. what exact command you run: Following that after following the steps in this tutorial of VISSL Benchmark Full-Finetuning on ImageNet-1K. when i run this command for finetning after successfully registering dummay_data !python run_distributed_engines.py \ hydra.verbose=true \ config=eval_resnet_8gpu_transfer_in1k_fulltune \ config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \ config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \ config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 \ config.DATA.TEST.DATA_SOURCES=[disk_folder] \ config.DATA.TEST.LABEL_SOURCES=[disk_folder] \ config.DATA.TEST.DATASET_NAMES=[dummy_data_folder] \ config.DATA.TEST.BATCHSIZE_PER_REPLICA=2 \ config.OPTIMIZER.num_epochs=2 \ config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] \ config.OPTIMIZER.param_schedulers.lr.milestones=[1] \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.CHECKPOINT.DIR="./checkpoints" \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/resnet50-19c8e357.pth" \ config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk._feature_blocks." \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""

what you observed (including full logs): This is the log which i comming. Please help me in this regard as i am stuck this i tried diferent different version but all giving me the same issue

08-11 05:20:16,063 trainer_main.py: 167: Loss is: CrossEntropyLoss() INFO 2021-08-11 05:20:16,063 trainer_main.py: 168: Starting training.... INFO 2021-08-11 05:20:16,064 init.py: 72: Distributed Sampler config: {'num_replicas': 1, 'rank': 0, 'epoch': 0, 'num_samples': 10, 'total_size': 10, 'shuffle': True, 'seed': 0} /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) Traceback (most recent call last): File "run_distributed_engines.py", line 194, in hydra_main(overrides=overrides) File "run_distributed_engines.py", line 179, in hydra_main hook_generator=default_hook_generator, File "run_distributed_engines.py", line 123, in launch_distributed hook_generator=hook_generator, File "run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "run_distributed_engines.py", line 159, in process_main hook_generator=hook_generator, File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 102, in train_main trainer.train() File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 171, in train self._advance_phase(task) # advances task.phase_idx File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/trainer_main.py", line 286, in _advance_phase phase_type, epoch=task.phase_idx, compute_start_iter=compute_start_iter File "/usr/local/lib/python3.7/dist-packages/vissl/trainer/train_task.py", line 501, in recreate_data_iterator self.data_iterator = iter(self.dataloaders[phase_type]) File "/usr/local/lib/python3.7/dist-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 40, in iter self.preload() File "/usr/local/lib/python3.7/dist-packages/classy_vision/dataset/dataloader_async_gpu_wrapper.py", line 46, in preload self.cache_next = next(self._iter) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 83, in return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 85, in default_collate raise TypeError(default_collate_err_msg_format.format(elem_type)) TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.Image.Image'>

please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset. I have followed this example Benchmark Full-Finetuning on ImageNet-1K

Please help me in this regard i will be thankfull to you

iseessel commented 3 years ago

@ghalib2021 Hi Ahmed, can you please send me what your eval_resnet_8gpu_transfer_in1k_fulltune.yaml looks like?

aiapps2020 commented 3 years ago

@iseessel i am also facing this issue on this example of running with SimCLR as my main objective is to train unsupervised library by your work. I have shared the link of my colab. I have done exact same settings as above successfully installed all the libraries but still facing issue when following the exact tutorial. I hope your help me in this regard. I will be really thankful to you. I have pasted the link of yaml file which i downloaded https://dl.fbaipublicfiles.com/vissl/tutorials/configs/quick_1gpu_resnet50_simclr.yaml

https://colab.research.google.com/drive/1Dayb-GkoidpxiogWbOKW8uqHyN6NpPsY?usp=sharing

se} Traceback (most recent call last): File "run_distributed_engines.py", line 194, in hydra_main(overrides=overrides) File "run_distributed_engines.py", line 179, in hydra_main hook_generator=default_hook_generator, File "run_distributed_engines.py", line 112, in launch_distributed daemon=False, File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/content/run_distributed_engines.py", line 166, in _distributed_worker process_main(cfg, dist_run_id, local_rank=local_rank, node_id=node_id) File "/content/run_distributed_engines.py", line 159, in process_main hook_generator=hook_generator, File "/usr/local/lib/python3.7/dist-packages/vissl/engines/train.py", line 84, in train_main torch.cuda.set_device(local_rank) File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal

iseessel commented 3 years ago

It seems the root cause of both of the problems are that the training is not reading the configs properly.

@aiapps2020 your problem is that the config is trying to run a training on 8gpus when there is only 1 available.

@ghalib2021 your problem is that the config is not reading any of the transforms (in particular the img is not being converted to a tensor) and is erroring out.

aiapps2020 commented 3 years ago

@iseessel i double cross check the config file number of gpus paramter is one also i followed the exact tutorial even downloaded the whl file exact same and make sure that there is no conflict in version. I have attached my yaml file which i downloaded and giveen the path also the google colab file. if you help me in verifying where i went wrong i will be thank ful to you https://dl.fbaipublicfiles.com/vissl/tutorials/configs/quick_1gpu_resnet50_simclr.yaml

iseessel commented 3 years ago

Hi there @aiapps2020 and @ghalib2021 The issue is that hydra 1.1 is installed, which has a breaking change that causes issues with reading your configs.

Can you downgrade your hydra version to 1.0.6?

ghalib2021 commented 3 years ago

@iseessel Thank u very much let me try

ghalib2021 commented 3 years ago

Thanku very much @iseessel already can :)

facebookresearch / vissl

Bug in Benchmark Full-Finetuning on ImageNet-1K #396

Instructions To Reproduce the 🐛 Bug: