Closed junwenxiong closed 3 years ago
Hi, the training was done using batchsize of 128 trained for 50000 batches for the separation model as specified in README. The model was trained on a 8-gpu machine with V100 gpus. I have shared some pre-trained models that you can directly use/test without re-training yourself. Hope that helps.
Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!
Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!
Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.
Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!
Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.
I can reproduce a result nearly the same as the released model with 8 V100 GPUs, But I don't know how to create the test dataset, I just randomly choose some samples from test dir and obtain sdr=7.8, and sdr=7.3 with the released model
For the results in the paper, I used the version with context shared here: https://github.com/facebookresearch/VisualVoice/tree/master/av-separation-with-context. I have shared the pre-trained models that you may use. We randomly sampled 2000 pairs for testing, and I just digged out a script that I used in evaluation. Hope that helps!
#!/usr/bin/env python
import os
import subprocess
import json
import random
import glob
import h5py
import argparse
import pickle
def split(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
#python testBunchSubmit.py --model_root checkpoints/exp6/ --data_root /private/home/rhgao/datasets/VoxCeleb2/seen_heard_test/ --output_dir results/exp6_seen_heard_test --num_of_examples_to_test 2000 --num_of_jobs_to_submit 10
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model_root', type=str, required=True)
parser.add_argument('--data_root', type=str, required=True)
parser.add_argument('--output_dir', type=str, required=True)
parser.add_argument('--num_of_examples_to_test', type=int, default=10, help="num of examples to test")
parser.add_argument('--num_of_jobs_to_submit', type=int, default=1, help="num of jobs to submit")
parser.add_argument('--job_index_start', type=int, default=0, help="num of classes")
args = parser.parse_args()
if not os.path.isdir(args.output_dir):
os.mkdir(args.output_dir)
IDs = sorted(os.listdir(os.path.join(args.data_root, 'mp4')))
random.seed(0)
id_pairs = []
for i in range(len(IDs)):
for j in range(len(IDs)):
if i == j:
continue
id_pairs.append((i,j))
print(len(id_pairs))
selected_id_pairs = random.sample(id_pairs, args.num_of_examples_to_test)
cmds2execute = []
for id_pair in selected_id_pairs:
i = id_pair[0]
j = id_pair[1]
video1_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[i])))
clip1_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[i], video1_name)))
video1_path = os.path.join(args.data_root, 'mp4', IDs[i], video1_name, clip1_name[:-4] + '.mp4')
audio1_path = os.path.join(args.data_root, 'aac', IDs[i], video1_name, clip1_name[:-4] + '.wav')
mouthroi1_path = os.path.join(args.data_root, 'mouth_roi_hdf5', IDs[i], video1_name, clip1_name[:-4] + '.h5')
video2_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[j])))
clip2_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[j], video2_name)))
video2_path = os.path.join(args.data_root, 'mp4', IDs[j], video2_name, clip2_name[:-4] + '.mp4')
audio2_path = os.path.join(args.data_root, 'aac', IDs[j], video2_name, clip2_name[:-4] + '.wav')
mouthroi2_path = os.path.join(args.data_root, 'mouth_roi_hdf5', IDs[j], video2_name, clip2_name[:-4] + '.h5')
cmd = 'python test.py' + \
' --audio1_path ' + audio1_path + \
' --audio2_path ' + audio2_path + \
' --mouthroi1_path ' + mouthroi1_path + \
' --mouthroi2_path ' + mouthroi2_path + \
' --video1_path ' + video1_path + \
' --video2_path ' + video2_path + \
' --num_frames 64 --video_sampling_rate 2 ' + \
' --audio_length 2.55 --hop_size 160 --window_size 400 --n_fft 512 ' + \
' --weights_lipreadingnet ' + os.path.join(args.model_root, 'lipreading_best.pth') + \
' --weights_identity ' + os.path.join(args.model_root, 'identity_best.pth') + \
' --weights_unet ' + os.path.join(args.model_root, 'unet_best.pth') + \
' --weights_classifier ' + os.path.join(args.model_root, 'classifier_best.pth') + \
' --lipreading_config_path configs/lrw_snv1x_tcn2x.json ' + \
' --unet_type beta9 --unet_output_nc 2 --normalization --mask_to_use pred ' + \
' --visual_feature_type both --identitynet_type resnet18 --voicenet_type resnet18 --identity_feature_dim 128 --audioVisual_feature_dim 1152 --visual_pool maxpool ' + \
' --compression_type none --hyperbolic_compression_K 10 --hyperbolic_compression_C 0.1 ' + \
' --sigmoidal_compression_a 1 --sigmoidal_compression_b 0 --mask_clip_threshold 5 --hop_length 2.55 ' + \
' --reliable_face --lipreading_extract_feature --number_of_identity_frames 1 --output_dir_root ' + os.path.join(args.output_dir)
cmd = cmd + ' ; python evaluateSeparation.py --results_dir ' + os.path.join(args.output_dir, IDs[i]+'_'+video1_name+'_'+clip1_name[:-4]+'VS'+IDs[j]+'_'+video2_name+'_'+clip2_name[:-4])
cmd = cmd + '\n'
cmds2execute.append(cmd)
job_splits = list(split(range(len(cmds2execute)), args.num_of_jobs_to_submit))
count = 0
for sub_split in job_splits:
count = count + 1
print(count)
script2exe = open('slurm_script/' + str(args.job_index_start + count) + '.sh', 'w')
script2exe.write('#!/bin/bash\n')
script2exe.write('export CUDA_VISIBLE_DEVICES=0\n')
for i in sub_split:
script2exe.write(cmds2execute[i])
script2exe.close()
cmd = 'chmod a+x \'slurm_script/' + str(args.job_index_start + count) + '.sh\''
subprocess.call(cmd, shell=True)
# generate slurm submit file
slurm_file = open('submit.slurm','w')
slurm_file.write('#!/bin/bash\n')
slurm_file.write('#SBATCH --job-name=' + 'test-' + str(args.job_index_start + count) + '\n')
slurm_file.write('#SBATCH --output=' + 'slurm_output/test' + str(args.job_index_start + count) + '.out\n')
slurm_file.write('#SBATCH --error=' + 'slurm_output/test' + str(args.job_index_start + count) + '.err\n')
slurm_file.write('#SBATCH --nodes=1\n')
slurm_file.write('#SBATCH --ntasks-per-node=1\n')
slurm_file.write('#SBATCH --time 2:00:00\n')
slurm_file.write('#SBATCH --partition=learnfair\n') #scavenge
slurm_file.write('#SBATCH --gres=gpu:1\n')
slurm_file.write('#SBATCH --cpus-per-task=20\n')
slurm_file.write('# Module init\n')
slurm_file.write('module purge\n')
slurm_file.write('module load anaconda3\n')
slurm_file.write('module load cuda/10.0 NCCL cudnn\n')
slurm_file.write('source activate video_clone\n')
slurm_file.write('srun --label slurm_script/' + str(args.job_index_start + count) + '.sh\n')
slurm_file.close()
cmd = 'sbatch submit.slurm'
subprocess.call(cmd, shell=True)
if __name__ == '__main__':
main()
Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!
Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.
I can reproduce a result nearly the same as the released model with 8 V100 GPUs, But I don't know how to create the test dataset, I just randomly choose some samples from test dir and obtain sdr=7.8, and sdr=7.3 with the released model
Hello! I trained from scratch and test on my test set, still having a 2 -3 dB gap. I also tested the author's pre-train models and got 9.8 dB, closing to the author's 10.1 dB. So I was wondering whether the batch size is the key problem.
My test set is the following, which is created from the author's unseen unheard test set. All samples are fully randomized. So I got 1711 size : (59*59 - 59)/2.
The test split attachment is below, you can run it and report your result here. Thank you. mixturess_1711_short.txt
I've trained the model for up to 200 thousands epoch, but the sdr performence is only 7.4 in unseen_unheard_test set. It is wondered how long the model has been training in paper. Due to the limitation of the numbers of GPUs, the config of paper can't be implemented. So, any advice will help me.
my training config below
--gpu_ids 0,1 \ --batchSize 10 \ --nThreads 16 \ --decay_factor 0.5 \ --num_batch 400000 \ --lr_steps 40000 80000 120000 160000 200000 \ --coseparation_loss_weight 0.01 \ --mixandseparate_loss_weight 1 \ --crossmodal_loss_weight 0.01 \ --lr_lipreading 0.0001 \ --lr_facial_attributes 0.00001 \ --lr_unet 0.0001 \ --lr_vocal_attributes 0.00001 \ --