Closed wjj-w closed 4 months ago
Hi, sorry for my late reply.
Are you using the same pytorch version I recommend in the repo?
I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.
Thanks.
Hi, sorry for my late reply.
Are you using the same pytorch version I recommend in the repo?
I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.
Thanks.
Hi, sorry for my late reply.
Are you using the same pytorch version I recommend in the repo?
I have not encountered this error before. It seems like the issue comes from the wrong input type of different variables when gathering them from different gpus. Could you print the type and format of the variables? Then, I can have a closer look at what is going on there.
Thanks.
Thank you for your reply. I printed the ones_like input tensor. It seems to be a tuple rather than a tensor. The version of pytorch I use is 1.11.0, which is the same as your repo.
Individual Tensor Type: <class 'tuple'> Individual Tensor Content: ([tensor([[23.7616, 23.2275, 23.0269, ..., 20.6832, 22.3174, 23.6219], [20.2759, 26.1319, 19.4015, ..., 17.1857, 19.1118, 27.7969], [22.3471, 26.9985, 24.6766, ..., 21.7580, 19.7570, 26.3889], ..., [27.5078, 20.9853, 28.0410, ..., 29.1786, 25.9105, 26.3851], [21.5458, 18.9113, 21.4838, ..., 17.2257, 21.6529, 23.6969], [19.2916, 17.5876, 17.6534, ..., 15.9270, 14.3989, 18.1825]], device='cuda:0'), None], [None, None])
Hi, I do not know how this error appears. Could you tell me which script you are running?
Hi, I do not know how this error appears. Could you tell me which script you are running?
I'm running train_clip_B2N_ucf.sh. My Settings are as follows.
ROOT=/home/wjj/projects/froster/froster CKPT=/home/wjj/projects/froster/ckpt_output
B2N_ucf_file=B2N_ucf101 TRAIN_FILE=train.csv VAL_FILE=val.csv TEST_FILE=test.csv
cd $ROOT
TORCH_DISTRIBUTED_DEBUG=INFO python -W ignore -u tools/run_net.py \ --cfg configs/Kinetics/TemporalCLIP_vitb16_8x16_STAdapter_UCF101.yaml \ --opts DATA.PATH_TO_DATA_DIR $ROOT/zs_label_db/$B2N_ucf_file \ DATA.PATH_PREFIX /mnt/DataDisk01/Data/UCF101/dataset/UCF-101\ TRAIN_FILE $TRAIN_FILE \ VAL_FILE $VAL_FILE \ TEST_FILE $TEST_FILE \ DATA.PATH_LABEL_SEPARATOR , \ DATA.INDEX_LABEL_MAPPING_FILE $ROOT/zs_label_db/$B2N_ucf_file/train_rephrased.json \ TRAIN.ENABLE True \ OUTPUT_DIR $CKPT/basetraining/B2N_ucf101_froster \ TRAIN.BATCH_SIZE 32\ TEST.BATCH_SIZE 240 \ TEST.NUM_ENSEMBLE_VIEWS 3 \ TEST.NUM_SPATIAL_CROPS 1 \ NUM_GPUS 4 \ SOLVER.MAX_EPOCH 12 \ SOLVER.WARMUP_EPOCHS 2.0 \ SOLVER.BASE_LR 3.33e-6 \ SOLVER.WARMUP_START_LR 3.33e-8 \ SOLVER.COSINE_END_LR 3.33e-8 \ TRAIN.MIXED_PRECISION True \ DATA.DECODING_BACKEND "pyav" \ MODEL.NUM_CLASSES 51 \ MIXUP.ENABLE False \ AUG.ENABLE False \ AUG.NUM_SAMPLE 1 \ TRAIN.EVAL_PERIOD 1 \ TRAIN.CHECKPOINT_PERIOD 1 \ MODEL.LOSS_FUNC soft_cross_entropy \ TRAIN.LINEAR_CONNECT_CLIMB False \ TRAIN.CLIP_ORI_PATH /root/.cache/clip/ViT-B-16.pt \ TRAIN.LINEAR_CONNECT_LOSS_RATIO 0.0 \ MODEL.RAW_MODEL_DISTILLATION True \ MODEL.KEEP_RAW_MODEL True \ MODEL.DISTILLATION_RATIO 2.0
Hi, I have tried to run the experiment with this script, which works fine on my server.
To avoid the possible environmental issue, I have uploaded the requirements.txt file in the repo. You can refer to it and check your installed packages. I hope this could help you!
Hi, I have tried to run the experiment with this script, which works fine on my server.
To avoid the possible environmental issue, I have uploaded the requirements.txt file in the repo. You can refer to it and check your installed packages. I hope this could help you!
Thank you! I'll check my environment.
Hello, I am trying to train on the UCF101 dataset, but I have encountered this problem in the test stage. Do you have any solutions to this problem? I am looking forward to your reply very much. Traceback (most recent call last): File "tools/run_net.py", line 56, in
main()
File "tools/run_net.py", line 41, in main
launch_job(cfg=cfg, init_method=args.init_method, func=test)
File "/home/wjj/projects/froster/froster/slowfast/utils/misc.py", line 416, in launch_job
torch.multiprocessing.spawn(
File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/wjj/projects/froster/froster/slowfast/utils/multiprocessing.py", line 60, in run ret = func(cfg) File "/home/wjj/projects/froster/froster/tools/test_net.py", line 427, in test test_meter = perform_test(test_loader, model, test_meter, cfg, writer) File "/home/wjj/.conda/envs/mff/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, **kwargs) File "/home/wjj/projects/froster/froster/tools/test_net.py", line 145, in perform_test preds, labels, video_idx = du.all_gather([preds, labels, video_idx]) File "/home/wjj/projects/froster/froster/slowfast/utils/distributed.py", line 41, in all_gather tensor_placeholder = [ File "/home/wjj/projects/froster/froster/slowfast/utils/distributed.py", line 42, in
torch.oneslike(tensor) for in range(world_size)
TypeError: ones_like(): argument 'input' (position 1) must be Tensor, not tuple