IFICL / SLfM

Official code for the paper: [ICCV2023] Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation
https://ificl.github.io/SLfM/
MIT License
34 stars 8 forks source link

Demo code #4

Closed 1390806607 closed 10 months ago

1390806607 commented 10 months ago

Hello, https://ificl.github.io/SLfM/ the real - world inside the realization of the demo code can be sent to me have a look

IFICL commented 10 months ago

We implement the demo using matplotlib in a naive way. We don't plan to share this part of the code. While the implementation of demo generation is based on the other project: https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py .

1390806607 commented 10 months ago

image Hello, is this picture correct

IFICL commented 10 months ago

Please see the reply from https://github.com/IFICL/SLfM/issues/5#issuecomment-1823213433. The model is correct.

deBrian07 commented 10 months ago

If possible, could you please briefly explain the logic of the demo code for this project? I checked out https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py, the structure seems to be very similar to the evaluate_angle.py in this project. Could you please share a little bit about the demo code for this project. Thank you so much!

IFICL commented 10 months ago

@1390806607 @deBrian07 The demo code is very simple. You set up the audio (with 0.51s audio length) and vision model. For the dataloader, you will extract the current frame and corresponding audio with 0.51s. The vision model requires two images where we set up the keyframes to accumulately calculate the rotation. Here I provide with demovideo dataloader code:

import csv
import glob
import h5py
import io
import json
import librosa
import numpy as np
import os
import pickle
from PIL import Image
from PIL import ImageFilter
import random
import scipy
import soundfile as sf
import time
from tqdm import tqdm
import glob
import cv2

import torch
import torch.nn as nn
import torchaudio
import torchvision.transforms as transforms

import sys
sys.path.append('..')
from data import AudioSFMbaseDataset

import pdb

class SingleVideoDataset(AudioSFMbaseDataset):
    def __init__(self, args, pr, list_sample, split='train'):
        self.pr = pr
        self.args = args
        self.split = split
        self.seed = pr.seed
        self.image_transform = transforms.Compose(self.generate_image_transform(args, pr))

        self.repeat = args.repeat if split == 'train' else 1

        video_path = list_sample
        audio_path = os.path.join(video_path, 'audio', 'audio.wav')
        frame_path = os.path.join(video_path, 'frames')
        meta_path = os.path.join(video_path, 'meta.json')
        with open(meta_path, "r") as f:
            self.meta_dict = json.load(f)

        # audio_sample_rate = meta_dict['audio_sample_rate']
        self.frame_rate = self.meta_dict['frame_rate']
        frame_list = glob.glob(f'{frame_path}/*.jpg')
        frame_list.sort()

        # import pdb; pdb.set_trace()
        self.frame_list = frame_list
        audio, self.audio_rate = self.read_audio(audio_path)
        audio = np.transpose(audio, (1, 0))
        audio = self.normalize_audio(audio, desired_rms=0.1)
        self.audio = torch.from_numpy(audio.copy()).float()
        num_sample = len(self.frame_list)

        # calculate the keyframes:
        if args.keyframe_interval == None:
            args.keyframe_interval = num_sample
        self.keyframe_inds = np.arange(0, num_sample, step=args.keyframe_interval)

        # print('Video Dataloader: # of frames {}: {}'.format(self.split, num_sample))

    def __getitem__(self, index):
        # import pdb; pdb.set_trace()
        audio_length = self.audio.shape[1]
        frame_path = self.frame_list[index]
        start_time = index / self.meta_dict['frame_rate'] - self.pr.clip_length / 2
        audio_rate = self.audio_rate
        clip_length = int(self.pr.clip_length * self.audio_rate)
        audio_start_time = int(start_time * self.audio_rate)
        audio_end_time = audio_start_time + clip_length

        if audio_start_time < 0:
            audio_start_time = 0
            audio_end_time = audio_start_time + clip_length

        if audio_end_time > audio_length:
            audio_end_time = audio_length
            audio_start_time = audio_end_time - clip_length

        img_2 = self.read_image(frame_path)

        audio = self.audio[:, audio_start_time: audio_end_time]

        # determine reference image 
        keyframe_ind = int(index // self.args.keyframe_interval)

        # current index is the keyframe, we set the reference image to previous keyframe
        if index % self.args.keyframe_interval == 0:
            if keyframe_ind !=0:
                keyframe_ind -= 1

        img1_ind = self.keyframe_inds[keyframe_ind]
        img_1 = self.read_image(self.frame_list[img1_ind])

        batch = {
            'img_1': img_1,
            'img_2': img_2,
            'img_path': frame_path,
            'keyframe_ind': keyframe_ind,
            'audio': audio,
        }
        return batch

    def getitem_test(self, index):
        self.__getitem__(index)

    def __len__(self): 
        return len(self.frame_list)

For the inference code:

# For smoothing the prediction which used inside visualization code
def smooth_prediction(signal, window_length):
    signal_padding = torch.tensor([signal[-1]] * (window_length - 1))
    signal = torch.tensor(signal)
    signal = torch.cat([signal, signal_padding], dim=0)
    signal = signal.unfold(-1, window_length, 1)
    signal = signal.cpu().numpy()
    signal_mean = signal.mean(-1)
    signal_std = signal.std(-1)
    return signal_mean, signal_std

def predict(args, pr, net_vision, net_audio, batch, device):
    # import pdb; pdb.set_trace()
    inputs = {}
    inputs['img_1'] = batch['img_1'].to(device)
    inputs['img_2'] = batch['img_2'].to(device)
    _, camere_angle_pred = net_vision(inputs['img_2'], inputs['img_1'], return_angle=True)
    camere_angle_pred = rot2theta(args, camere_angle_pred) * pr.rotation_correctness

    inputs['audio'] = batch['audio'].to(device)
    _, sound_angle_pred = net_audio(inputs['audio'], return_angle=True)
    sound_angle_pred = logit2angle(args, sound_angle_pred)

    return {
        'camera_pred': camere_angle_pred,
        'sound_pred': sound_angle_pred,
    }

def inference(args, pr, net_vision, net_audio, data_set, data_loader, device='cuda', video_idx=None):
    # import pdb; pdb.set_trace()
    net_vision.eval()
    net_audio.eval()

    img_path_list = []
    camera_preds = []
    sound_preds = []
    keyframe_inds = []

    with torch.no_grad():
        for step, batch in tqdm(enumerate(data_loader), total=len(data_loader), desc="Inference"):
            # import pdb; pdb.set_trace()
            img_paths = batch['img_path']
            keyframe_ind = batch['keyframe_ind']
            out = predict(args, pr, net_vision, net_audio, batch, device)

            camera_preds.append(out['camera_pred'])
            sound_preds.append(out['sound_pred'])
            keyframe_inds.append(keyframe_ind)

            for i in range(args.batch_size):
                img_path_list.append(img_paths[i])

    # import pdb; pdb.set_trace()

    img_path_list = np.array(img_path_list)
    camera_preds = torch.cat(camera_preds, dim=-1).data.cpu().numpy()
    sound_preds = torch.cat(sound_preds, dim=-1).data.cpu().numpy()
    keyframe_inds = torch.cat(keyframe_inds, dim=-1).data.cpu().numpy()

    keyframe_camera_preds = camera_preds[data_set.keyframe_inds]
    keyframe_camera_preds = np.cumsum(keyframe_camera_preds)
    camera_preds += keyframe_camera_preds[keyframe_inds]
    if args.vis_predict_only:
        visualization_prediction(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)
    else:
        visualization_video(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)

I think those codes should be far more than enough. I won't provide any codes related to the demo further to avoid duplicated style figures appearing.

deBrian07 commented 10 months ago

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

IFICL commented 10 months ago

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

Those are self-collected videos using iPhone and binaural mics.

deBrian07 commented 10 months ago

Got it. For binaural mics, do you have any suggestions? Since different mics might have different purpose.

IFICL commented 10 months ago

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

deBrian07 commented 10 months ago

Please look at the video attached. I recorded the video with my iPhone with the stereo option on. However, neither the video nor audio prediction makes sense to me. Could you please take a look at it and give any suggestions? Thank you so much!

https://github.com/IFICL/SLfM/assets/94733710/b243d003-e100-4ea1-9cc3-033a73f0bb66

IFICL commented 10 months ago

There are several issues:

  1. First of all, the model is trained on landscape images, so it's not possible to directly work on the portrait images. And you need to set up keyframe_interval to make it work.
  2. Second, a stereo mic is not a binaural mic that fits the simulated HRTF. And then you need to know that when you record video using portrait mode, the mics are upside down, not on the left or right.

One thing I want to make clear: I recommend trying to debug your issues on your own first before asking me. I will only answer the questions for this repo.

deBrian07 commented 10 months ago

Got it, I'll try to debug it myself first, thank you so much!

deBrian07 commented 10 months ago

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

Hello, could you possibly share the binaural videos as soon as you have it? Thank you so much!

IFICL commented 9 months ago

@deBrian07 Hi, we have uploaded the demo videos to this github repo. Please see Readme for details. Note, The demo videos are for research purposes only.