Plachtaa / FAcodec

Training code for FAcodec presented in NaturalSpeech3
134 stars 15 forks source link

Does the prosody codes[0] work? #10

Open dalazymodder opened 1 month ago

dalazymodder commented 1 month ago

I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?

Plachtaa commented 1 month ago

what kind of test did you performed on prosody code?

dalazymodder commented 1 month ago

I took two different sound bytes and changed the line

z2 = model.encoder(codes[0], codes[1], timbre2, use_p_code=False, n_c=1)

line 103 to

z2 = model.encoder(codes2[0], codes[1], timbre, use_p_code=False, n_c=1)

this should make it so it only outputs the file with prosody changed between the two different sound files right?

But if you look at the resulting files they appear to be identical for the the unedited reconstruction vs the new one that should have different prosody.

image

I also tried changing the code to this, which changed content and prosody.

z2 = model.encoder(codes[0], codes2[1], timbre, use_p_code=False, n_c=1)

image

Sorry if I got anything wrong I'm a novice at this but isnt the prosody kind of like the emotion and timing of the speech?

I did add some lines to pad both audio files to same length, but I don't think that should affect the prosody.

def main(args): source = args.source target = args.target source_audio = librosa.load(source, sr=24000)[0] ref_audio = librosa.load(target, sr=24000)[0]

# Find the length of the longest audio and add a small buffer (e.g., 1 second)
max_length = max(len(source_audio), len(ref_audio))
target_length = max_length + 24000  # Add 1 second (24000 samples at 24kHz)

# Pad both audios to the target length
source_audio = np.pad(source_audio, (0, target_length - len(source_audio)), mode='constant')
ref_audio = np.pad(ref_audio, (0, target_length - len(ref_audio)), mode='constant')

# Convert to torch tensors
source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)
Plachtaa commented 1 month ago

Thanks for your experiment. It was very helpful for us to understand what exactly the prosody component stands for. I will try to replicate your experiment myself, and will tell you if I could give you any explanations about this.