TArdelean / CainGAN

Other
53 stars 8 forks source link

Getting CUDA error: initialization error during training #1

Open rra94 opened 4 years ago

rra94 commented 4 years ago

Traceback of the error below:

Traceback (most recent call last): File "CainGAN/train.py", line 103, in main() File "CainGAN/train.py", line 99, in main train() File "CainGAN/train.py", line 42, in train for i_batch, (frames, marks, i) in enumerate(dataLoader, start=staring_point): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/content/CainGAN/dataset/vid_dataset.py", line 57, in getitem for frame in frames]) File "/content/CainGAN/dataset/vid_dataset.py", line 57, in for frame in frames]) File "/content/CainGAN/dataset/video_extractor.py", line 117, in plot_landmarks fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, flip_input=False, device=device) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/face_alignment/api.py", line 69, in init self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/face_alignment/detection/sfd/sfd_detector.py", line 28, in init self.face_detector.to(device) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply param.data = fn(param.data) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: initialization error

TArdelean commented 4 years ago

Hi! It is because you cannot start another CUDA context inside the dataloader. You can either

rra94 commented 4 years ago

Thanks for the quick response. Do you provide code for landmark precompute or do I have to manually do that?

kenoharada commented 4 years ago

in dataset dir(https://github.com/TArdelean/CainGAN/tree/master/dataset), I saw landmark detection process in video_extractor.py:+1:

TArdelean commented 4 years ago

Indeed, the functionality for landmark extraction is in video_extractor.py For ease of use I have just updated the repository with a script that precomputes landmarks: extract_landmarks.py

rra94 commented 4 years ago

thanks! I'll try it out this weekend and close this issue.

rra94 commented 4 years ago

Hi,

I was able to precompute landmarks and had added the landmarks directory as --landmark_root .

Now I get this error. Do I have to do something else as well?

Total number of parameters: 5535618 Summary: Encoder - 3400994 Generator - 54556603 Discriminators - 11071236 Total - 69028833 Start training from epoch 0 computing lands Asking for 9 frames out of 0; using replace mode computing lands Asking for 9 frames out of 0; using replace mode computing lands Asking for 9 frames out of 0; using replace mode Traceback (most recent call last): File "CainGAN/train.py", line 103, in main() File "CainGAN/train.py", line 99, in main train() File "CainGAN/train.py", line 42, in train for i_batch, (frames, marks, i) in enumerate(dataLoader, start=staring_point): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/envs/fewshot/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/content/CainGAN/dataset/vid_dataset.py", line 55, in getitem frames = select_frames(self.video_paths[index], self.K) File "/content/CainGAN/dataset/video_extractor.py", line 92, in select_frames frame_idxs = sample_frames(length, K, mandatory=mandatory) File "/content/CainGAN/dataset/video_extractor.py", line 20, in sample_frames sampled = np.random.choice(options, K, replace=True) File "mtrand.pyx", line 1125, in mtrand.RandomState.choice ValueError: 'a' cannot be empty unless no samples are taken

TArdelean commented 4 years ago

From your logs: computing lands -> Means that the precomputed landmark path was not actually found. Check again if the paths are correctly configured. First log line should tell you if they were loaded properly: print(f"Preprocessed landmarks {pres} out of {len(video_paths)}") Asking for 9 frames out of 0; using replace mode -> Means you actually have problems with loading the videos too. Make sure the paths are appropriate.