Spleeter Pre-Processing

gazugafan commented 3 years ago

I've been testing this out to automatically generate forced alignment Karaoke lyrics for an open source project. So far the results are super impressive! I've tested a TON of options, and this is by far the best I've found. Nothing comes close!

My basic workflow is... 1) Isolate the vocals using spleeter. This works really well and leaves you with separate WAV files for the vocals and the accompaniment. 2) Lookup lyrics on genius 3) Supply AutoLyrixAlign with the lyrics and the original polyphonic music file to get timestamped words.

It works pretty great! I started wondering, though... since spleeter is fairly new, and seems to work really well... have you considered training a new dataset on just the isolated vocals? Would that give even more accurate results?

In other words... isolate the vocals by processing all of the songs in the dataset with spleeter first, and then train the same way you did before (but using just the isolated vocals instead of the original polyphonic audio). And of course, when running the alignment, be sure to pre-process the input using spleeter (or assume the input is already isolated vocals from spleeter).

What do you think? Is this a crazy idea?

swanux commented 3 years ago

Exactly the same applies to me. With some pre-, and post-processing the result is truly incredible. Regarding spleeter, I thought about it too, but after I read their papers, I think it may not be as useful as thought. See here

I may be completely wrong though as I have nothing to do with math / IT / neural networks.

gazugafan commented 3 years ago

I took a quick shot at trying it out myself, but honestly it's a bit over my head as well. My instinct was just telling me that the vocals (being sung and not spoken like a normal speech to text dataset) must be the key to it working so well. And if that's the case, the accompaniment being there would just get in the way. Maybe the accompaniment actually does help, though. Interesting!

Can I ask what pre and post processing you're doing, @swanux ? I'm trying out lots of combinations right now, but haven't figured out what works best.

chitralekha18 commented 3 years ago

Hi Ken,

Thanks for the feedback. We did try training the acoustic model with extracted vocals. However we have used sony's open unmix, not spleeter (Unmix and spleeter performances are reported to be similar). The alignment results were comparable to using polyphonic (mix audio). We will publish these results soon.

On Sun, 4 Jul 2021, 04:39 Ken, @.***> wrote:

I've been testing this out to automatically generate forced alignment Karaoke lyrics for an open source project. So far the results are super impressive! I've tested a TON of options, and this is by far the best I've found. Nothing comes close!

My basic workflow is...

Isolate the vocals using spleeter. This works really well and leaves you with separate WAV files for the vocals and the accompaniment.

Lookup lyrics on genius

Supply AutoLyrixAlign with the lyrics and the original polyphonic music file to get timestamped words.

It works pretty great! I started wondering, though... since spleeter is fairly new, and seems to work really well... have you considered training a new dataset on just the isolated vocals? Would that give even more accurate results?

In other words... isolate the vocals by processing all of the songs in the dataset with spleeter first, and then train the same way you did before (but using just the isolated vocals instead of the original polyphonic audio). And of course, when running the alignment, be sure to pre-process the input using spleeter (or assume the input is already isolated vocals from spleeter).

What do you think? Is this a crazy idea?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chitralekha18/AutoLyrixAlign/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTT2JJ7XDATTH3GWGZWNVLTV5YRTANCNFSM47YO4TCA .

swanux commented 3 years ago

@gazugafan First of all I'd like to state that I didn't touch the model itself at all, and it's rather scripting / basic automation.

What I referred to as "preprocessing", is basically two things:

having a proper audio file (interesting that I got the best results with lower quality, for example on tidal, the "Low" - in m4a format originally - was the best)
having proper lyrics (I used web scraping, and then automatically formatted the lyrics to only contain the pure text, no empty lines, no verse numbers, no special things - like "Breath*" - only what's spoken.

My so-called post-processing, is only one thing actually:

I converted the produced timed file into a word-by-word srt file but, I used the text from the previously formatted lyrics split to words - rather than the raw results. This fixed a lots of formatting issues, sometimes even mixed up timing. And then I wrote a gtk app that can handle this weird srt format.

With these, most of the songs can be aligned without manual corrections - only providing the artist, the sound file and the track names in a txt file for batch processing. See here my (now archived) project which used this method (video demo at the bottom).

gazugafan commented 3 years ago

Thanks @swanux and @chitralekha18 ! I think I'm going to wrap this project in a simple node express API, with some simple pre/post processing baked in to make it more resilient. Exciting stuff!

gazugafan commented 3 years ago

Just finished that API wrapper I mentioned! It's over here if anyone is interested... https://github.com/gazugafan/AutoLyrixAlignService

Hopefully it'll make this project easier for people to use. It should be able to handle any lyrics you throw at it--no matter how many special characters, extra lines, etc. are included. And it takes the results and matches them back up with the lyrics as they were originally entered. So you get all the original punctuation and spacing, and you even get naive timings added to things like [Chorus], (woo), etc.

Gazoo101 commented 3 years ago

Not to hijack this thread too much, but if you're interested in automating some of your workflow @gazugafan , you may want to check out this project I've built:

https://github.com/Gazoo101/lyric-manager

gazugafan commented 3 years ago

Nice! Looks like your approach to the alignment process is about the same as mine... cleanup the lyrics for AutoLyrixAlign, send them to AutoLyrixAlign by running a new singularity process, and then match the results back up to the original lyrics.

I thought the process of fetching lyrics would be best left outside this. In my project, the user searches for a YouTube video, we try to parse out the artist and song title from the video title, and then lookup the lyrics on Genius. But... that's not always going to work exactly right, and sometimes lyrics just won't be available. So, there are some small interactive bits in between where the user can confirm we got it right, correct the artist and song title if not, or even enter their own lyrics if they want.

Gazoo101 commented 3 years ago

That sounds like the same.

Indeed. Lyric lookup is the type of problem one might easily think is simple, but turns out to be surprisingly hard. Having interactive bits to allow the user to guide the process sounds like a great idea @gazugafan ! I've only had the time/energy to automate the approach to eventually align lyrics for a larger collection of audio.

If you have any interest in visualizing the lyrics, let me know: https://www.youtube.com/watch?v=_J1hhTWgCXM

yacaeh commented 3 years ago

@Gazoo101 Hey I've saw your video and it's amazing, is it fully automated? Can you give me some hints how to make it?

Gazoo101 commented 3 years ago

@yacaeh Much appreciated - Once the appropriate lyric file has been generated, the visuals are fully automated and customizable during a performance.

As the video description denotes, it's demo-output from PlanmixPlay.

Assuming by 'how to make it' you're referring to generating the visuals, that's 'fairly' straight-forward:

Pick a song, and generate an .aligned_lyrics file using my lyric manager tool
Place the song and the newly generated lyric file together somewhere
Load the song with PlanMixPlay and pick the lyric visualizer you prefer.

Let me know how you go!

chitralekha18 / AutoLyrixAlign

Spleeter Pre-Processing #1