Closed VicaYang closed 2 years ago
I move run_distributed_engines.py
out of the folder tools
and it works now, but I am still wondering whether I can place the dataset_catalog.json
on some location so that the prebuilt vissl can load it anyway when other tries failed
Hi @VicaYang,
First of all, thanks a lot for using VISSL and reporting this!
I tried your example and got the same issue you had (actually, I am not sure I got exactly the same case, as I am not sure exactly to understand where the dataset_catalog.json
is in the filesystem in the test case your reported - in my case, I created a configs/config/dataset_catalog.json
file right next to thetools
folder).
Here is a way to deal with this issue temporarily while I debug it further:
You can use the environment variable VISSL_DATASET_CATALOG_PATH
to point to your own dataset catalog (the one holding the paths to imagenet1k in your case) like so:
VISSL_DATASET_CATALOG_PATH=configs/config/dataset_catalog.json python tools/run_distributed_engines.py \
config=benchmark/linear_image_classification/imagenet1k/eval_resnet_8gpu_transfer_in1k_linear \
config.CHECKPOINT.DIR="..." \
config.MODEL.WEIGHTS_INIT.PARAMS_FILE="..."
In my case, this solved the issue. Could you please try it and report what you got?
Thank you, Quentin
Thank @QuentinDuval for your help.
I paste the tree
result below to help any others who meet similar issues.
.
├── configs
│ ├── config
│ │ ├── benchmark
│ │ ├── dataset_catalog.json
│ │ ├── debugging
│ │ ├── extract_cluster
│ │ ├── feature_extraction
│ │ ├── __init__.py
│ │ ├── model_zoo
│ │ ├── pretrain
│ │ └── test
│ ├── __init__.py
│ └── __pycache__
│ └── __init__.cpython-38.pyc
├── run_distributed_engines.py
└── tools
├── cluster_assignments_to_dataset.py
├── cluster_features_and_label.py
├── __init__.py
├── instance_retrieval_test.py
├── launch_benchmark_suite_scheduler_slurm.py
├── nearest_neighbor_test.py
├── object_detection_benchmark.py
├── perf_measurement
│ ├── benchmark_data.py
│ ├── benchmark_transforms.py
│ ├── __init__.py
│ └── README.md
├── run_distributed_engines.py
├── train_svm_low_shot.py
└── train_svm.py
Under this folder structure, running python tools/run_distributed_engines.py
cannot load the datasets correctly, while python run_distributed_engines.py
and VISSL_DATASET_CATALOG_PATH=configs/config/dataset_catalog.json python tools/run_distributed_engines.py
can. I believed that using VISSL_DATASET_CATALOG_PATH
is the best choice.
If you do not know the root cause of the problem, and wish someone to help you, please post according to this template:
Instructions To Reproduce the Issue:
Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:
git diff
) I copytools/run_distributed_engines.py
andconfigs
in the folder and modifyconfigs/config/dataset_catalog.json
CPU info:
Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 56 On-line CPU(s) list 0-55 Thread(s) per core 2 Core(s) per socket 14 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 79 Model name Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping 1 CPU MHz 1202.994 CPU max MHz 3300.0000 CPU min MHz 1200.0000 BogoMIPS 4800.00 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 35840K NUMA node0 CPU(s) 0-13,28-41 NUMA node1 CPU(s) 14-27,42-55
INFO 2022-05-12 09:03:02,206 trainer_main.py: 112: Using Distributed init method: tcp://localhost:35561, world_size: 8, rank: 0 INFO 2022-05-12 09:03:02,264 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 2 INFO 2022-05-12 09:03:02,339 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 6 INFO 2022-05-12 09:03:02,450 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 3 INFO 2022-05-12 09:03:02,580 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 7 INFO 2022-05-12 09:03:02,590 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 5 INFO 2022-05-12 09:03:02,596 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 1 INFO 2022-05-12 09:03:02,604 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 4 INFO 2022-05-12 09:03:02,612 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0 INFO 2022-05-12 09:03:02,612 trainer_main.py: 130: | initialized host test as rank 0 (0) INFO 2022-05-12 09:03:02,615 trainer_main.py: 130: | initialized host test as rank 4 (4) INFO 2022-05-12 09:03:02,615 trainer_main.py: 130: | initialized host test as rank 3 (3) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 2 (2) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 6 (6) INFO 2022-05-12 09:03:02,617 trainer_main.py: 130: | initialized host test as rank 1 (1) INFO 2022-05-12 09:03:02,621 trainer_main.py: 130: | initialized host test as rank 5 (5) INFO 2022-05-12 09:03:02,622 trainer_main.py: 130: | initialized host test as rank 7 (7) INFO 2022-05-12 09:03:11,236 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,237 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,237 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,237 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,237 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,240 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,241 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,241 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,241 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,241 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,245 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,245 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,246 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,246 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,246 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,246 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,246 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,247 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,247 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,247 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,248 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,249 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,249 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,249 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,249 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,250 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,250 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,249 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,250 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,250 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,250 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,250 train_task.py: 181: Not using Automatic Mixed Precision INFO 2022-05-12 09:03:11,251 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,252 train_task.py: 455: Building model.... INFO 2022-05-12 09:03:11,252 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,252 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,253 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,253 feature_extractor.py: 27: Creating Feature extractor trunk... INFO 2022-05-12 09:03:11,253 resnext.py: 64: ResNeXT trunk, supports activation checkpointing. Deactivated INFO 2022-05-12 09:03:11,254 resnext.py: 87: Building model: ResNeXt50-1x64d-w1-BatchNorm2d INFO 2022-05-12 09:03:11,920 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,922 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,923 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,925 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,925 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,955 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,956 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:11,978 feature_extractor.py: 50: Freezing model trunk... INFO 2022-05-12 09:03:12,368 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,368 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,369 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 4 INFO 2022-05-12 09:03:12,389 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,389 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,390 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 2 INFO 2022-05-12 09:03:12,401 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,401 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,401 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 3 INFO 2022-05-12 09:03:12,412 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,413 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,413 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 6 INFO 2022-05-12 09:03:12,415 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,415 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,416 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 1 INFO 2022-05-12 09:03:12,425 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,426 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,426 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 0 INFO 2022-05-12 09:03:12,434 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,434 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,434 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 5 INFO 2022-05-12 09:03:12,442 model_helpers.py: 177: Using SyncBN group size: 8 INFO 2022-05-12 09:03:12,442 model_helpers.py: 181: Converting BN layers to Apex SyncBN INFO 2022-05-12 09:03:12,442 distributed_c10d.py: 187: Added key: store_based_barrier_key:2 to store for rank: 7 INFO 2022-05-12 09:03:12,454 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,454 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,454 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,454 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,455 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,455 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,458 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,459 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,459 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,460 util.py: 276: Attempting to load checkpoint from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,465 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,465 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,465 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,465 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,466 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,466 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,466 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,467 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,472 train_task.py: 472: config.MODEL.FEATURE_EVAL_SETTINGS.FREEZE_TRUNK_ONLY=True, will freeze trunk... INFO 2022-05-12 09:03:12,472 base_ssl_model.py: 195: Freezing model trunk... INFO 2022-05-12 09:03:12,473 train_task.py: 429: Initializing model from: ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,473 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,689 util.py: 281: Loaded checkpoint from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:12,689 util.py: 240: Broadcasting checkpoint loaded from ../weights/resnet50-19c8e357.pth INFO 2022-05-12 09:03:16,755 train_task.py: 435: Checkpoint loaded: ../weights/resnet50-19c8e357.pth... INFO 2022-05-12 09:03:16,760 checkpoint.py: 885: Loaded: trunk.base_model._feature_blocks.conv1.weight of shape: torch.Size([64, 3, 7, 7]) from checkpoint