Getting CUDA error: initialization error during training

rra94 commented 4 years ago

Traceback of the error below:

Traceback (most recent call last): File "CainGAN/train.py", line 103, in main() File "CainGAN/train.py", line 99, in main train() File "CainGAN/train.py", line 42, in train for i_batch, (frames, marks, i) in enumerate(dataLoader, start=staring_point): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/content/CainGAN/dataset/vid_dataset.py", line 57, in getitem for frame in frames]) File "/content/CainGAN/dataset/vid_dataset.py", line 57, in for frame in frames]) File "/content/CainGAN/dataset/video_extractor.py", line 117, in plot_landmarks fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, flip_input=False, device=device) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/face_alignment/api.py", line 69, in init self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/face_alignment/detection/sfd/sfd_detector.py", line 28, in init self.face_detector.to(device) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply param.data = fn(param.data) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: initialization error

TArdelean commented 4 years ago

Hi! It is because you cannot start another CUDA context inside the dataloader. You can either

Run the face alignment network on the CPU
Set num_workers to 0, so that everything is working on the main process
Precompute landmarks (recommended) using GPU so that you don't have to use 2DFA during training at all

rra94 commented 4 years ago

Thanks for the quick response. Do you provide code for landmark precompute or do I have to manually do that?

kenoharada commented 4 years ago

in dataset dir(https://github.com/TArdelean/CainGAN/tree/master/dataset), I saw landmark detection process in video_extractor.py:+1:

TArdelean commented 4 years ago

Indeed, the functionality for landmark extraction is in video_extractor.py For ease of use I have just updated the repository with a script that precomputes landmarks: extract_landmarks.py

rra94 commented 4 years ago

thanks! I'll try it out this weekend and close this issue.

rra94 commented 4 years ago

Hi,

I was able to precompute landmarks and had added the landmarks directory as --landmark_root .

Now I get this error. Do I have to do something else as well?

Total number of parameters: 5535618 Summary: Encoder - 3400994 Generator - 54556603 Discriminators - 11071236 Total - 69028833 Start training from epoch 0 computing lands Asking for 9 frames out of 0; using replace mode computing lands Asking for 9 frames out of 0; using replace mode computing lands Asking for 9 frames out of 0; using replace mode Traceback (most recent call last): File "CainGAN/train.py", line 103, in main() File "CainGAN/train.py", line 99, in main train() File "CainGAN/train.py", line 42, in train for i_batch, (frames, marks, i) in enumerate(dataLoader, start=staring_point): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/content/CainGAN/dataset/vid_dataset.py", line 55, in getitem frames = select_frames(self.video_paths[index], self.K) File "/content/CainGAN/dataset/video_extractor.py", line 92, in select_frames frame_idxs = sample_frames(length, K, mandatory=mandatory) File "/content/CainGAN/dataset/video_extractor.py", line 20, in sample_frames sampled = np.random.choice(options, K, replace=True) File "mtrand.pyx", line 1125, in mtrand.RandomState.choice ValueError: 'a' cannot be empty unless no samples are taken

TArdelean commented 4 years ago

From your logs: computing lands -> Means that the precomputed landmark path was not actually found. Check again if the paths are correctly configured. First log line should tell you if they were loaded properly: print(f"Preprocessed landmarks {pres} out of {len(video_paths)}") Asking for 9 frames out of 0; using replace mode -> Means you actually have problems with loading the videos too. Make sure the paths are appropriate.

TArdelean / CainGAN

Getting CUDA error: initialization error during training #1

Traceback of the error below: