Cuda out of memory error when preprocessing data for training

dipam7 commented 4 years ago

I am trying to train a model with my own data. I have the following directory structure:

Wav2Lip
    |____training_data
               |_______*.mp4 files

I've change the line in preprocess.py from

filelist = glob(path.join(args.data_root, '*/*.mp4'))

to

filelist = glob(path.join(args.data_root, '*.mp4'))

for my directory structure. However, when I run the command given in the readme, I get the following error for every video:

Traceback (most recent call last):
  File "preprocess.py", line 85, in mp_handler
    process_video_file(vfile, args, gpu_id)
  File "preprocess.py", line 59, in process_video_file
    preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))
  File "/storage/Wav2Lip/face_detection/api.py", line 66, in get_detections_for_batch
    detected_faces = self.face_detector.detect_from_batch(images.copy())
  File "/storage/Wav2Lip/face_detection/detection/sfd/sfd_detector.py", line 42, in detect_from_batch
    bboxlists = batch_detect(self.face_detector, images, device=self.device)
  File "/storage/Wav2Lip/face_detection/detection/sfd/detect.py", line 68, in batch_detect
    olist = net(imgs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/storage/Wav2Lip/face_detection/detection/sfd/net_s3fd.py", line 71, in forward
    h = F.relu(self.conv1_1(x))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 343, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 340, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 15.82 GiB (GPU 0; 15.90 GiB total capacity; 847.43 MiB already allocated; 14.48 GiB free; 14.57 MiB cached)

My videos are all 1080p

I'm using Paperspace with a P5000 GPU, 8 CPUs and 30 Gb Ram. Can you specify what computing power did you use to train and how can I use the one I have available to train my own model? Thanks

prajwalkr commented 4 years ago

Please reduce the batch size until the error stops occuring. If, for the whole dataset the error occurs just once or twice it's alright. If it occurs too often, reduce the batch size while preprocessing.

dipam7 commented 4 years ago

I reduced the batch size to 4 and it worked for a few videos. However, for a certain video I just get "Killed". Is it because the video is long and high resolution? I've tried a batch size of 2 as well but the same thing happens. Why is this happening and any suggestions for overcoming it? Thanks

prajwalkr commented 4 years ago

Ensure you are training on face resolutions of 96x96 only, to start with. Also, ensure the temporal window of 5 frames only.

dipam7 commented 4 years ago

Hey, I haven't reached the training stage yet. I am still preprocessing the data. Do I have to ensure the things that you mentioned for pre-processing as well? If yes, how do I do that?

prajwalkr commented 4 years ago

No, you can just preprocess with a lower batch size to avoid memory errors.

dipam7 commented 4 years ago

I'm already using a batch size of 2. Is it possible that this is because my videos are long ( > 5 minutes) and high res (1080p)? Should I break them down into smaller chunks?

prajwalkr commented 4 years ago

No, I do not think long videos is a reason for GPU memory error. Batch size 2 should work. There must be some other mistake due to which it is not working.

Rudrabha / Wav2Lip

Cuda out of memory error when preprocessing data for training #36