error with torch.distributed for multiprocesssing

quantumiracle commented 2 months ago

Hi,

I want to use VBench with torch.distributed for multiprocessing evaluation, however, I found only the first process can finish while all the rest processes cannot successfully finish. Here is the code snip:

import torch
import torch.distributed as dist
import multiprocessing
from vbench import VBench

def setup(rank, world_size):
    """ Setup PyTorch distributed environment. """
    dist.init_process_group(
        backend='nccl',  # 'gloo' or 'nccl' if you're using GPUs
        init_method='tcp://127.0.0.1:12347',
        rank=rank,
        world_size=world_size
    )

def cleanup():
    """ Cleanup distributed environment. """
    dist.destroy_process_group()

def evaluate_videos(rank, world_size):
    setup(rank, world_size)

    # Initialize VBench within a distributed environment
    device = rank
    config_path = 'VBench_full_info.json'
    save_dir = f'test_rank_{rank}'
    video_path = './'

    my_VBench = VBench(device, config_path, save_dir)
    my_VBench.evaluate(
        videos_path=video_path,
        name=f'VideoEvaluation_{rank}',
        dimension_list=['temporal_flickering'],
        mode='custom_input'
    )

    cleanup()

def run_distributed():
    world_size = 2  # Number of processes
    processes = []

    for rank in range(world_size):
        p = multiprocessing.Process(target=evaluate_videos, args=(rank, world_size))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

if __name__ == '__main__':
    run_distributed()

NattapolChan commented 2 months ago

Just to clarify, may I know what you are trying to achieve here? Is this for distributing process across multiple gpus? or is it for multi-process multi-gpu? The current code structure does not work even without vbench.

HanLiii commented 1 month ago

Hi @NattapolChan,

My evaluation process stops at "start evaluation" when running with multi GPUs:

vbench evaluate --ngpus=8 --dimension 'motion_smoothness' --videos_path data/demofusion_controlnet.mp4 --mode=custom_input WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2024-09-28 12:04:28,518 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1 2024-09-28 12:04:28,520 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 5 2024-09-28 12:04:28,531 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 2 2024-09-28 12:04:28,616 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 3 2024-09-28 12:04:29,361 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 7 2024-09-28 12:04:29,387 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 4 2024-09-28 12:04:29,410 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 6 2024-09-28 12:04:29,412 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0 2024-09-28 12:04:29,412 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,413 - torch.distributed.distributed_c10d - INFO - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,413 - torch.distributed.distributed_c10d - INFO - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,414 - torch.distributed.distributed_c10d - INFO - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,414 - torch.distributed.distributed_c10d - INFO - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,418 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,418 - torch.distributed.distributed_c10d - INFO - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2024-09-28 12:04:29,421 - torch.distributed.distributed_c10d - INFO - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. args: Namespace(output_path='./evaluation_results/', full_json_dir='/data/hli358/envs/vbench/lib/python3.10/site-packages/vbench/cli/../VBench_full_info.json', videos_path='data/demofusion_controlnet.mp4', dimension=['motion_smoothness'], load_ckpt_from_local=None, read_frame=None, mode='custom_input', prompt='None', prompt_file=None, category=None, imaging_quality_preprocessing_mode='longer') start evaluation

NattapolChan commented 1 month ago

Can you try running these:

vbench evaluate --ngpus=1 --dimension 'motion_smoothness' --videos_path data/demofusion_controlnet.mp4 --mode=custom_input
vbench evaluate --ngpus=8 --dimension 'temporal_flickering' --videos_path data/demofusion_controlnet.mp4 --mode=custom_input

For the ngpus flag, it is to distribute the video across multiple gpu, so if there is only one video, only 1 gpu will be used for inference. The rest will not be allocated any video.

HanLiii commented 1 month ago

Can you try running these:

vbench evaluate --ngpus=1 --dimension 'motion_smoothness' --videos_path data/demofusion_controlnet.mp4 --mode=custom_input

vbench evaluate --ngpus=8 --dimension 'temporal_flickering' --videos_path data/demofusion_controlnet.mp4 --mode=custom_input

For the ngpus flag, it is to distribute the video across multiple gpu, so if there is only one video, only 1 gpu will be used for inference. The rest will not be allocated any video.

I create a 'duck' folder contains 5 videos,

torchrun --nproc_per_node=1 --standalone evaluate.py --dimension 'subject_consistency' 'background_consistency' 'motion_smoothness' 'dynamic_degree' 'aesthetic_quality' 'imaging_quality' --videos_path duck/ --mode=custom_input This command will run the evaluation program smoothly 2.torchrun --nproc_per_node=2 --standalone evaluate.py --dimension 'subject_consistency' 'background_consistency' 'motion_smoothness' 'dynamic_degree' 'aesthetic_quality' 'imaging_quality' --videos_path duck/ --mode=custom_input

This command will stuck at start evaluation: WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2024-09-30 14:21:29,921 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1 2024-09-30 14:21:29,932 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. 2024-09-30 14:21:29,932 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0 2024-09-30 14:21:29,932 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. args: Namespace(output_path='./evaluation_results/', full_json_dir='/data/hli358/han/VBench/vbench/VBench_full_info.json', videos_path='duck/', dimension=['subject_consistency', 'background_consistency', 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality'], load_ckpt_from_local=None, read_frame=None, mode='custom_input', prompt='None', prompt_file=None, category=None, imaging_quality_preprocessing_mode='longer') start evaluation

NattapolChan commented 1 month ago

I have tried quite a few configurations, but cannot make it to get stuck at start evaluation. The command I used is the same and the test folder also contains 5 videos. Here is the log:

2024-10-02 02:10:27,722 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2024-10-02 02:10:27,722 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2024-10-02 02:10:27,722 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2024-10-02 02:10:27,722 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
args: Namespace(output_path='./evaluation_results/', full_json_dir='/scratch/users/nattapol/VBench/vbench/VBench_full_info.json', videos_path='../video_mp4/', dimension=['subject_consistency', 'background_consistency', 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality'], load_ckpt_from_local=None, read_frame=None, mode='custom_input', prompt='None', prompt_file=None, category=None, imaging_quality_preprocessing_mode='longer')
start evaluation
Evaluation meta data saved to ./evaluation_results/results_2024-10-02-02:10:27_full_info.json
cur_full_info_path: ./evaluation_results/results_2024-10-02-02:10:27_full_info.json
Using cache found in /scratch/users/nattapol/.cache/torch/hub/facebookresearch_dino_main
Using cache found in /scratch/users/nattapol/.cache/torch/hub/facebookresearch_dino_main
2024-10-02 02:10:44,195 - vbench.subject_consistency - INFO - Initialize DINO success
2024-10-02 02:10:44,207 - vbench.subject_consistency - INFO - Initialize DINO success
100%|██████████| 2/2 [00:32<00:00, 16.27s/it]
cur_full_info_path: ./evaluation_results/results_2024-10-02-02:10:27_full_info.json
100%|██████████| 2/2 [00:13<00:00,  6.94s/it]
...

Could you try with only temporal_flickering dimension just to make sure it's not during the initialization?

HanLiii commented 1 month ago

I tried torchrun --nproc_per_node=2 --standalone evaluate.py --dimension 'temporal_flickering' --videos_path duck/ --mode=custom_input, it will also ayuck at start evaluation, following is the log:

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2024-10-02 12:01:39,292 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1 2024-10-02 12:01:39,306 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0 2024-10-02 12:01:39,306 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. 2024-10-02 12:01:39,312 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. args: Namespace(output_path='./evaluation_results/', full_json_dir='/data/hli358/han/VBench/vbench/VBench_full_info.json', videos_path='duck/', dimension=['temporal_flickering'], load_ckpt_from_local=None, read_frame=None, mode='custom_input', prompt='None', prompt_file=None, category=None, imaging_quality_preprocessing_mode='longer') start evaluation

Vchitect / VBench

error with torch.distributed for multiprocesssing #63