DanBigioi / DiffusionVideoEditing

Official project repo for paper "Speech Driven Video Editing via an Audio-Conditioned Diffusion Model"
MIT License
226 stars 15 forks source link

Check failed: res == 0 (11 vs. 0) pthread_create failed #4

Closed Li-Jicheng closed 1 year ago

Li-Jicheng commented 1 year ago

Hi there,

I was trying to train the model and I came across this error. My server is 4*V100 GPU same as the setting reported in your manuscript. Can you help me check the reason?

Thank you!

(DVE) jicheng@lambda-server:~/DiffusionVideoEditing$ python run.py -c config/audio_talking_heads.json -p train export CUDA_VISIBLE_DEVICES=0,1,2,3 using GPU 0 for training using GPU 3 for training using GPU 2 for training using GPU 1 for training /home/jicheng/DiffusionVideoEditing/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') <core.logger.InfoLogger object at 0x7f34f7f70760> /home/jicheng/DiffusionVideoEditing/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') <core.logger.InfoLogger object at 0x7f8152b19760> /home/jicheng/DiffusionVideoEditing/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') <core.logger.InfoLogger object at 0x7f5059173760> /home/jicheng/DiffusionVideoEditing/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') <core.logger.InfoLogger object at 0x7f9b0f919760> done1 done1 done1 done1 done2 done2 done2 done2 hello there hello there hello there hello there hello there hello there hello there hello there done3 0%| | 0/2690 [00:00<?, ?it/s]done3 done3 0%| | 0/2690 [00:00<?, ?it/s]done3 0%| | 0/2690 [00:00<?, ?it/s]INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.888828 10138 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.898816 14040 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.899549 9704 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.905575 4768 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.906440 8967 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.906997 7471 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.911995 5160 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.911994 13659 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: F20230801 21:50:32.908715 11734 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR WARNING: Logging before InitGoogleLogging() is written to STDERR WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.996906 3905 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed F20230801 21:50:32.996906 1846 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed F20230801 21:50:32.996907 3535 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.997499 49046 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.997561 12889 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:32.998524 984 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: F20230801 21:50:32.997519 6317 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.002343 13270 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.010723 5785 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.002553 2338 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.006657 12117 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.012253 6315 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.012764 6943 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.008280 6702 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.013471 9800 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.009902 4003 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.021328 9404 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.022063 13274 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.026723 8247 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.028404 13272 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.050551 10959 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.017261 13944 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: 0%| | 0/2690 [01:29<?, ?it/s] WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.432536 1418 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.434023 14760 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: 0%| | 0/2690 [01:29<?, ?it/s] WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.448812 15144 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.452442 10525 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.462917 2736 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.467746 12501 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.469575 3078 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.471261 11737 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.472213 609 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.473325 9018 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.486632 8246 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.486693 12118 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.499100 9015 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.504900 1417 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.505055 14336 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.513185 1847 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.523406 5929 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed Check failure stack trace: 0%| | 0/2690 [01:30<?, ?it/s] 0%| | 0/2690 [01:30<?, ?it/s] Traceback (most recent call last): File "run.py", line 96, in mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, opt)) File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/queue.py", line 179, in get self.not_empty.wait(remaining) File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout) File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 8967) is killed by signal: Aborted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jicheng/DiffusionVideoEditing/run.py", line 63, in main_worker model.train() File "/home/jicheng/DiffusionVideoEditing/core/base_model.py", line 45, in train train_log = self.train_step() File "/home/jicheng/DiffusionVideoEditing/models/model.py", line 126, in train_step for train_data in tqdm.tqdm(self.phase_loader): File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter for obj in iterable: File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data idx, data = self._get_data() File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1163, in _get_data success, data = self._try_get_data() File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 8967) exited unexpectedly

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/jicheng/anaconda3/envs/DVE/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/jicheng/DiffusionVideoEditing/run.py", line 69, in main_worker phase_writer.close() File "/home/jicheng/DiffusionVideoEditing/core/logger.py", line 118, in close self.writer.close() AttributeError: 'NoneType' object has no attribute 'close'

DanBigioi commented 1 year ago

Hello!

This looks like a problem with mediapipe. Could you run the following for me and double check this works:

import cv2
import mediapipe as mp
import numpy as np 

def contour_extractor(path_to_img):
    landmark_points_68 = [162, 234, 93, 58, 172, 136, 149, 148, 152, 377, 378, 365, 397, 288, 323, 454, 389, 71,
                          63, 105, 66, 107, 336,
                          296, 334, 293, 301, 168, 197, 5, 4, 75, 97, 2, 326, 305, 33, 160, 158, 133, 153, 144,
                          362, 385, 387, 263, 373,
                          380, 61, 39, 37, 0, 267, 269, 291, 405, 314, 17, 84, 181, 78, 82, 13, 312, 308, 317,
                          14, 87]
    IMAGE_LIST = [path_to_img]
    with mp_face_mesh.FaceMesh(
            static_image_mode=True,
            max_num_faces=1,
            refine_landmarks=True,
            min_detection_confidence=0.1) as face_mesh:
        for file in (IMAGE_LIST):
            image = cv2.imread(file)
            # Convert the BGR image to RGB before processing.
            results = face_mesh.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
            # Print and draw face mesh landmarks on the image.
            if not results.multi_face_landmarks:
                frame_landmark_list = np.zeros((468,3))
            else:
                for face_landmarks in results.multi_face_landmarks:
                    frame_landmark_list = []
                    for i in range(0, 468):
                        pt1 = face_landmarks.landmark[i]
                        x = pt1.x
                        y = pt1.y
                        z = pt1.z
                        frame_landmark_list.append([x, y, z])
                    frame_landmark_list = np.asarray(frame_landmark_list)

            landmarks_extracted = frame_landmark_list[landmark_points_68]
            landmarks_extracted = np.asarray(landmarks_extracted)

            landmarks_extracted[:, 0] = landmarks_extracted[:, 0] * 128
            landmarks_extracted[:, 1] = landmarks_extracted[:, 1] * 128
            landmarks_extracted[:, 2] = landmarks_extracted[:, 2] * -128

            landmarks_extracted = landmarks_extracted[:, :2]

            landmark_list = []
            for items in landmarks_extracted:
                tuple = [int(items[0]), int(items[1])]
                landmark_list.append(tuple)

    return landmark_list

test_landmarks = contour_extractor(path_to_your_image)
Li-Jicheng commented 1 year ago

Hi DanBigioi,

Thank you for your prompt reply! I think Mediapipe works fine since I'm able to print out the landmarks. Personally, I would assume it's the Parallel training issue. I saw some similar issues discussed here. I'm not sure if that's related to our project.

https://github.com/tensorflow/tensorflow/issues/41532

Honestly, I'm no expert on parallel training. When I refer to chatgpt it suggests me this: "Keep in mind that torch.multiprocessing.spawn is mainly used for distributed training or parallel computation. If you just want to use multiple GPUs for a single task, consider using torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel instead." I tried to fix the code using torch.nn.DataParallel but didnot manage it.

Hopefully, these can provide some more details. Once again thanks for your help and look forward to your feedback :)

(DVE) jicheng@lambda-server:~/DiffusionVideoEditing$ python test_mp.py INFO: Created TensorFlow Lite XNNPACK delegate for CPU. [[26, 52], [26, 67], [27, 75], [32, 91], [36, 97], [40, 101], [50, 108], [60, 112], [67, 112], [73, 111], [82, 106], [90, 98], [93, 94], [96, 87], [99, 70], [99, 63], [97, 48], [30, 47], [36, 45], [42, 43], [48, 43], [56, 43], [70, 43], [76, 42], [83, 41], [88, 43], [93, 43], [63, 51], [64, 57], [64, 63], [65, 67], [59, 73], [61, 74], [65, 74], [68, 73], [70, 72], [41, 54], [44, 52], [49, 51], [55, 55], [50, 56], [45, 56], [72, 53], [77, 50], [82, 50], [86, 52], [82, 54], [77, 54], [57, 89], [59, 83], [62, 81], [65, 82], [68, 81], [71, 83], [73, 89], [72, 94], [69, 96], [66, 97], [62, 97], [59, 95], [58, 89], [63, 85], [65, 85], [67, 85], [71, 88], [68, 91], [66, 92], [63, 91]]

DanBigioi commented 1 year ago

Hmm I see, let me try investigate it further. Could you send me your config details, and also, since you're running this on a v100, did you use docker to set up the environment?

Do you know does the model train on 1 GPU? And does the error occur for 2 or more?

The reason I thought it might be mediapipe, is because the error is occuring within mediapipe, or at least that is what is throwing out the warning: "WARNING: Logging before InitGoogleLogging() is written to STDERR F20230801 21:50:33.504900 1417 threadpool_pthread_impl.cc:51] Check failed: res == 0 (11 vs. 0) pthread_create failed". Also, that particular message is associated with tensorflow, and mediapipe relies on it under the hood.

Check this: https://github.com/google/mediapipe/issues/1471 "The number 11 returned by pthread_create means "The system lacked the necessary resources to create another thread, or the system-imposed limit on the total number of threads in a process PTHREAD_THREADS_MAX would be exceeded."

Check this too: https://github.com/google/mediapipe/issues/2810

They both suggest its a memory issue. I looked into my dataloader code, and I think it should be ok, so I think it could be a problem of too big a batch size, and too many workers. If youre using a docker container to run everything, your shared memory might be too small (by default its 64mb, and I recommend setting it to something bigger such as 8gb or however much you want). For now though, try setting num workers to 0, and batch size to 1 so we can try debug it.

I'm not a huge multiprocessing expert either, I followed the code of https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models to help me with that, and its been working fine for me.

DanBigioi commented 1 year ago

Actually, an easy way to test this is to do the following:

within your dataloader code, replace the function get_mask

def get_mask(self, path):
    landmark_list = contour_extractor(path)
    mask = face_mask_square(self.image_size, landmark_list)

with:

def get_mask(self, path):
        mask = bbox2mask(self.image_size, random_bbox())

Just so we can see if you are still running out of memory, a different but similar error message should appear.

Li-Jicheng commented 1 year ago

Thank you for your quick reply! I think I'm very likely to fix it by setting up a lower num_of_workers for the data loader in the configuration file. Original was set to 40, I set it to 12 for now. It should function now, I'm still trying to finish training from scratch. I will mark it as closed if there are no follow-up questions. Anyway, thank you so much for your help:)

DanBigioi commented 1 year ago

No problem!

minhkhoi1026 commented 4 months ago

Thank you for your quick reply! I think I'm very likely to fix it by setting up a lower num_of_workers for the data loader in the configuration file. Original was set to 40, I set it to 12 for now. It should function now, I'm still trying to finish training from scratch. I will mark it as closed if there are no follow-up questions. Anyway, thank you so much for your help:)

Hi, can you teach me how lower the number of workers in the data loader?