Closed wasilone11 closed 8 months ago
Same while setting it to 4 GPU's
Detailed log:
File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/home2/wasilone11/GLC/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) ^^^^^^^^^ File "/ssd_scratch/cvit/wasi/GLC/tools/train_gaze_net.py", line 384, in train start_epoch = cu.load_train_checkpoint(cfg, model, optimizer, scaler if cfg.TRAIN.MIXED_PRECISION else None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home2/wasilone11/GLC/slowfast/utils/checkpoint.py", line 523, in load_train_checkpoint checkpoint_epoch = load_checkpoint( ^^^^^^^^^^^^^^^^ File "/home2/wasilone11/GLC/slowfast/utils/checkpoint.py", line 292, in load_checkpoint checkpoint = torch.load(f, map_location="cpu") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/serialization.py", line 993, in load with _open_zipfile_reader(opened_file) as opened_zipfile: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/serialization.py", line 447, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
I am using this to train:
CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/run_net.py --init_method tcp://localhost:9877 --cfg configs/Egtea/MVIT_B_16x4_CONV.yaml TRAIN.BATCH_SIZE 16 TEST.BATCH_SIZE 128 NUM_GPUS 2 TRAIN.CHECKPOINT_FILE_PATH /home2/wasilone11/GLC/K400_MVIT_B_16x4_CONV.pyth OUTPUT_DIR checkpoints/GLC DATA.PATH_PREFIX data/train_gaze_official.csv/ DATA.PATH_TO_DATA_DIR ssd_scratch/cvit/wasi/egtea/
Sorry, I was busy with a deadline in the past month. Is it solved?
while training using: CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/run_net.py --init_method tcp://localhost:9877 --cfg configs/Egtea/MVIT_B_16x4_CONV.yaml TRAIN.BATCH_SIZE 16 TEST.BATCH_SIZE 128 NUM_GPUS 2 TRAIN.CHECKPOINT_FILE_PATH /home2/wasilone11/GLC/MViT_Egtea_ckpt.pyth OUTPUT_DIR checkpoints/GLC DATA.PATH_PREFIX /ssd_scratch/cvit/wasi/egtea i get the error: ^^^^^^^^^^^^^^^^ File "/home2/wasilone11/GLC/slowfast/utils/checkpoint.py", line 292, in load_checkpoint checkpoint = torch.load(f, map_location="cpu") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/serialization.py", line 993, in load with _open_zipfile_reader(opened_file) as opened_zipfile: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home2/wasilone11/miniconda3/envs/gze/lib/python3.11/site-packages/torch/serialization.py", line 447, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory