benfmiller / audalign

Package for aligning audio files through audio fingerprinting
MIT License
84 stars 2 forks source link

How to sync and align one audio file wrt another audio file? #51

Closed akankshasingh25 closed 4 months ago

akankshasingh25 commented 9 months ago

I have a source audio file from a video and a target audio file, which is a cloned audio of the source audio file. I am trying to sync the cloned audio onto the original video. I tried align_files, but the saved final file has two channels (both source and target audio).

How do we align the target with the source? Thanks in advance for your help; so far, this repo gives the best results for the alignment task I am trying.

akankshasingh25 commented 9 months ago

Nvm, found the answer. The resulting audios are aligned. Closing the issue. Thanks for a great contribution.

benfmiller commented 9 months ago

Thanks! I'm glad it's useful for you

akankshasingh25 commented 5 months ago

Hi @benfmiller, upon alignment using all recognizers (visual recognizer fails), silence is added to the cloned audio file at the start. What could be the reason for adding silence and bad alignment between source and cloned files? The total.wav has a complete mismatch.

benfmiller commented 4 months ago

Adding silence at the start is the intended way to align files so that they are in sync when played together

The visual recognizer only works for rarer use cases. In general, it should only be used when the other recognizers can't produce a good enough alignment. It works by comparing the spectrogram of the audio files and calculating image similarity between the spectrograms. Spectrograms vary subtly based on the offset of the audio, and the comparison gives equal weight to frequencies and audio characteristics of the comparison timeframe, so there is a lot of variance.

I've found the visual recognizer to be best for situations where the overall soundscape between two audio files is the only defining feature in common, like recordings of ambient sound or atonal events.

akankshasingh25 commented 4 months ago

The silence at the start usually doesn't align the two audio files. Could it be because of the noise or unclear utterances introduced by cloning or TTS models, which cause the alignment to fail? I am using the correlation recognizer and correlation spectrogram recognizer for alignment and fine alignment. Which recognizer would be ideal theoretically for a 20-second long cloned or TTS generated?

Thanks, and sorry in advance for the help.

benfmiller commented 4 months ago

By cloned or TTS generated, it seems like they are generated from the same text input or are copies of the same audio files? Correlation works best with copies of the same audio events, fingerprinting works best with similar but different audio, and correlation spectrogram is somewhere in between.

Fingerprinting tends to work best with speech. It is very tonally dependent and has the most tunable parameters. There are some preset accuracy level settings to try, and the hash_style could also be useful to tune (fingerprinting config) The locality parameter can also help if there are audio characteristics like noise that would interfere with alignments.

It can also be helpful to run the remove_noise functions and/or uniform_level on the audio before alignment

What rankings are you getting on the alignment results?

No worries! Sorry for the late reply

akankshasingh25 commented 4 months ago

Thank you for your reply. I will close this issue now. You have been very helpful, I will try the things mentioned and give an update later maybe.