BadToBest / EchoMimic

Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
https://badtobest.github.io/echomimic.html
Apache License 2.0
2.26k stars 263 forks source link

Is it possible to batch process target mask .pkl files to reduce VRAM usage? #72

Closed Sidd065 closed 1 month ago

Sidd065 commented 1 month ago

I am using a driver video with ~2000 frames. Currently in infer_audio2vid_pose_acc.py all 2000 .pkl files in pose_dir are loaded at once and when face_locator is run I get a torch.cuda.OutOfMemoryError: CUDA out of memory. Is it possible to batch process the .pkl files and concatenate the resulting videos to reduce VRAM requirement for large videos?

for index in range(len(os.listdir(pose_dir))):
    tgt_musk_path = os.path.join(pose_dir, f"{index}.pkl")

    with open(tgt_musk_path, "rb") as f:
        tgt_kpts = pickle.load(f)
    tgt_musk = visualizer.draw_landmarks((args.W, args.H), tgt_kpts)
    tgt_musk_pil = Image.fromarray(np.array(tgt_musk).astype(np.uint8)).convert('RGB')
    pose_list.append(torch.Tensor(np.array(tgt_musk_pil)).to(dtype=weight_dtype, device="cuda").permute(2,0,1) / 255.0)
face_mask_tensor = torch.stack(pose_list, dim=1).unsqueeze(0)

video = pipe(
    ref_image_pil,
    audio_path,
    face_mask_tensor,
    width,
    height,
    args.L,
    args.steps,
    args.cfg,
    generator=generator,
    audio_sample_rate=args.sample_rate,
    context_frames=12,
    fps=final_fps,
    context_overlap=3
).videos

final_length = min(video.shape[2], face_mask_tensor.shape[2])
video = torch.cat([video[:, :, :final_length, :, :], face_mask_tensor[:, :, :final_length, :, :].detach().cpu()], dim=-1)
face_locator_tensor = self.face_locator(face_mask_tensor)
JoeFannie commented 1 month ago

It is possible to do that. I can give you some suggestions. (1) Split the 2000 frames to small subsets, for instance 250 frames per subset. (2) initialize variables subset by subset, such as landmarks, latents and audios. (3) call pipe on each subset. (4) concat the results as the final result.