Open gazugafan opened 3 years ago
Exactly the same applies to me. With some pre-, and post-processing the result is truly incredible. Regarding spleeter, I thought about it too, but after I read their papers, I think it may not be as useful as thought. See here
I may be completely wrong though as I have nothing to do with math / IT / neural networks.
I took a quick shot at trying it out myself, but honestly it's a bit over my head as well. My instinct was just telling me that the vocals (being sung and not spoken like a normal speech to text dataset) must be the key to it working so well. And if that's the case, the accompaniment being there would just get in the way. Maybe the accompaniment actually does help, though. Interesting!
Can I ask what pre and post processing you're doing, @swanux ? I'm trying out lots of combinations right now, but haven't figured out what works best.
Hi Ken,
Thanks for the feedback. We did try training the acoustic model with extracted vocals. However we have used sony's open unmix, not spleeter (Unmix and spleeter performances are reported to be similar). The alignment results were comparable to using polyphonic (mix audio). We will publish these results soon.
On Sun, 4 Jul 2021, 04:39 Ken, @.***> wrote:
I've been testing this out to automatically generate forced alignment Karaoke lyrics for an open source project. So far the results are super impressive! I've tested a TON of options, and this is by far the best I've found. Nothing comes close!
My basic workflow is...
- Isolate the vocals using spleeter. This works really well and leaves you with separate WAV files for the vocals and the accompaniment.
- Lookup lyrics on genius
- Supply AutoLyrixAlign with the lyrics and the original polyphonic music file to get timestamped words.
It works pretty great! I started wondering, though... since spleeter is fairly new, and seems to work really well... have you considered training a new dataset on just the isolated vocals? Would that give even more accurate results?
In other words... isolate the vocals by processing all of the songs in the dataset with spleeter first, and then train the same way you did before (but using just the isolated vocals instead of the original polyphonic audio). And of course, when running the alignment, be sure to pre-process the input using spleeter (or assume the input is already isolated vocals from spleeter).
What do you think? Is this a crazy idea?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chitralekha18/AutoLyrixAlign/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTT2JJ7XDATTH3GWGZWNVLTV5YRTANCNFSM47YO4TCA .
@gazugafan First of all I'd like to state that I didn't touch the model itself at all, and it's rather scripting / basic automation.
What I referred to as "preprocessing", is basically two things:
My so-called post-processing, is only one thing actually:
With these, most of the songs can be aligned without manual corrections - only providing the artist, the sound file and the track names in a txt file for batch processing. See here my (now archived) project which used this method (video demo at the bottom).
Thanks @swanux and @chitralekha18 ! I think I'm going to wrap this project in a simple node express API, with some simple pre/post processing baked in to make it more resilient. Exciting stuff!
Just finished that API wrapper I mentioned! It's over here if anyone is interested... https://github.com/gazugafan/AutoLyrixAlignService
Hopefully it'll make this project easier for people to use. It should be able to handle any lyrics you throw at it--no matter how many special characters, extra lines, etc. are included. And it takes the results and matches them back up with the lyrics as they were originally entered. So you get all the original punctuation and spacing, and you even get naive timings added to things like [Chorus], (woo), etc.
Not to hijack this thread too much, but if you're interested in automating some of your workflow @gazugafan , you may want to check out this project I've built:
Nice! Looks like your approach to the alignment process is about the same as mine... cleanup the lyrics for AutoLyrixAlign, send them to AutoLyrixAlign by running a new singularity process, and then match the results back up to the original lyrics.
I thought the process of fetching lyrics would be best left outside this. In my project, the user searches for a YouTube video, we try to parse out the artist and song title from the video title, and then lookup the lyrics on Genius. But... that's not always going to work exactly right, and sometimes lyrics just won't be available. So, there are some small interactive bits in between where the user can confirm we got it right, correct the artist and song title if not, or even enter their own lyrics if they want.
That sounds like the same.
Indeed. Lyric lookup is the type of problem one might easily think is simple, but turns out to be surprisingly hard. Having interactive bits to allow the user to guide the process sounds like a great idea @gazugafan ! I've only had the time/energy to automate the approach to eventually align lyrics for a larger collection of audio.
If you have any interest in visualizing the lyrics, let me know: https://www.youtube.com/watch?v=_J1hhTWgCXM
@Gazoo101 Hey I've saw your video and it's amazing, is it fully automated? Can you give me some hints how to make it?
@yacaeh Much appreciated - Once the appropriate lyric file has been generated, the visuals are fully automated and customizable during a performance.
As the video description denotes, it's demo-output from PlanmixPlay.
Assuming by 'how to make it' you're referring to generating the visuals, that's 'fairly' straight-forward:
.aligned_lyrics
file using my lyric manager toolLet me know how you go!
I've been testing this out to automatically generate forced alignment Karaoke lyrics for an open source project. So far the results are super impressive! I've tested a TON of options, and this is by far the best I've found. Nothing comes close!
My basic workflow is... 1) Isolate the vocals using spleeter. This works really well and leaves you with separate WAV files for the vocals and the accompaniment. 2) Lookup lyrics on genius 3) Supply AutoLyrixAlign with the lyrics and the original polyphonic music file to get timestamped words.
It works pretty great! I started wondering, though... since spleeter is fairly new, and seems to work really well... have you considered training a new dataset on just the isolated vocals? Would that give even more accurate results?
In other words... isolate the vocals by processing all of the songs in the dataset with spleeter first, and then train the same way you did before (but using just the isolated vocals instead of the original polyphonic audio). And of course, when running the alignment, be sure to pre-process the input using spleeter (or assume the input is already isolated vocals from spleeter).
What do you think? Is this a crazy idea?