Open dalazymodder opened 1 month ago
what kind of test did you performed on prosody code?
I took two different sound bytes and changed the line
z2 = model.encoder(codes[0], codes[1], timbre2, use_p_code=False, n_c=1)
line 103 to
z2 = model.encoder(codes2[0], codes[1], timbre, use_p_code=False, n_c=1)
this should make it so it only outputs the file with prosody changed between the two different sound files right?
But if you look at the resulting files they appear to be identical for the the unedited reconstruction vs the new one that should have different prosody.
I also tried changing the code to this, which changed content and prosody.
z2 = model.encoder(codes[0], codes2[1], timbre, use_p_code=False, n_c=1)
Sorry if I got anything wrong I'm a novice at this but isnt the prosody kind of like the emotion and timing of the speech?
I did add some lines to pad both audio files to same length, but I don't think that should affect the prosody.
def main(args): source = args.source target = args.target source_audio = librosa.load(source, sr=24000)[0] ref_audio = librosa.load(target, sr=24000)[0]
# Find the length of the longest audio and add a small buffer (e.g., 1 second)
max_length = max(len(source_audio), len(ref_audio))
target_length = max_length + 24000 # Add 1 second (24000 samples at 24kHz)
# Pad both audios to the target length
source_audio = np.pad(source_audio, (0, target_length - len(source_audio)), mode='constant')
ref_audio = np.pad(ref_audio, (0, target_length - len(ref_audio)), mode='constant')
# Convert to torch tensors
source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)
Thanks for your experiment. It was very helpful for us to understand what exactly the prosody component stands for. I will try to replicate your experiment myself, and will tell you if I could give you any explanations about this.
I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?