deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
25.88k stars 2.84k forks source link

[Discussion] Here is a bash script separating longer files with limited RAM #391

Closed amo13 closed 4 years ago

amo13 commented 4 years ago

I present to you a bash script to separate an audio file by splitting it into parts, separating then one by one and finally joining them back together:


Here was the first version of the script which would lead to cracks every 30s in the resulting stems. Find the new version of the script in the post below.


The script is for linux and assumes that you have (mini)conda installed. Feel free to use it and to modify it to your needs!

The following problem is now solved: Still, I need help in figuring out the last piece of the puzzle: When joining the separated parts back together to full-length stems, you can hear a crack in the audio where the original audio has been split and the parts have been joined together at the end. Here is the visualized symptom: Screenshot from 2020-05-22 17-20-35

Does anyone have an idea how to fix this?

mickdekkers commented 4 years ago

@amo13 thanks for sharing 😄

I encountered the same issue and asked about it here: https://github.com/deezer/spleeter/issues/162#issuecomment-562888908. One of the Spleeter collaborators kindly answered:

You can probably do a bit of overlapping between chunk with a smooth windowing to avoid the type of boundary artefacts you're experiencing

I haven't tried the suggested solution yet as I haven't worked on my project that uses Spleeter for a while. Unfortunately I'm also not familiar enough with audio processing to judge how difficult it would be to implement this.

amo13 commented 4 years ago

Yes, this idea also came to my mind, but it would mean processing everything twice while splitting at different offsets and then programmatically doing a lot of cutting and joining and praying that the pieces will fit. Haven't tried yet since it sounds pretty tedious to implement... but I was wondering if somebody would come up with a more practicable and ressource-friendly idea.

Thanks for the hint though!

amo13 commented 4 years ago

Since this didn't get out of my mind, I did the tedious fighting with bash and ffmpeg...

The result is: You can feed an audio file of any length into the script and the whole process is not going to eat more than 2GB RAM. I think for me it was around 1.6GB.

How it works:

Downside:

Use the script by putting your audio file in the same folder and calling ./separate.sh audio.mp3. You might need to comment out the first two lines depending on your conda installation.

Here is the working version of the script: separate.zip See the post below

mickdekkers commented 4 years ago

@amo13 thank you so much for staying on this and sharing your solution! 😄 This is a huge help to people trying to run Spleeter on memory-constrained systems. In my case, this lets me move forward with my personal project of running Spleeter as a pre-processing step in a voice transcription pipeline to recognize samples of famous speeches in music.

Regarding

  • The result is not 100% accurate: on a 3m30s track the stems were around 200ms too long. I am not sure about what exactly caused the 200ms error for me. I was suspecting ffmpeg being inaccurate when splitting and joining, but I don't really know. Anyway, the resulting stems are totally acceptable.

I remember coming across an issue that mentioned differing input and output durations. Perhaps the answer given there might also apply here?

Small variations of duration between WAV and MP3 versions of a signal are not uncommon and are probably just the result of some padding of zeroes to fit the length to a multiple of the MDCT size.

I haven't inspected the source code, but it could be that Spleeter performs some format conversion implicitly. Are the input files you tested MP3? The difference in duration documented in #96 was about 30ms, which is quite a bit less than 200ms. Still, perhaps splitting the audio exacerbated the effect. This is just a semi-educated guess, I'm no expert 😉

In any case, thanks again!

mickdekkers commented 4 years ago

Processes the audio twice with spleeter

I wonder if we could minimize the duplicate work that has to performed without significantly degrading the final quality. Given that the script only replaces 3 seconds of audio around the cracks, could we make spleeter re-do just those 3 seconds for each crack with a few more seconds for context?

mickdekkers commented 4 years ago

Linking #155 for reference. There's some interesting discussion there that also touches on splitting/stitching the audio.

amo13 commented 4 years ago

Processes the audio twice with spleeter

I wonder if we could minimize the duplicate work that has to performed without significantly degrading the final quality. Given that the script only replaces 3 seconds of audio around the cracks, could we make spleeter re-do just those 3 seconds for each crack with a few more seconds for context?

Yes of course, you are absolutely right. I should have thought of this! I guess cutting out seconds 28 to 32, 58 to 62, etc., "spleet" them and use 29 to 31, etc. should improve the overall processing time. Since the script is already functional, I can't promise I'll implement this, but if I do, I'll post it here again.

Also, I didn't compare the processing times with and without the script, but I'm kind of curious.

mmoussallam commented 4 years ago

Hi @amo13 very interesting work!

Regarding the cracks, I wonder if this is not related to #392 that we are currently investigating. If it's the case, could you try using the -B tensorflow option just to check ?

amo13 commented 4 years ago

Hi @amo13 very interesting work!

Regarding the cracks, I wonder if this is not related to #392 that we are currently investigating. If it's the case, could you try using the -B tensorflow option just to check ?

You are right, it fixes the issue. Thank you. I'll drop an updated version of my script here later, gotta go to work for now ;)

amo13 commented 4 years ago

Ok, so here is the new version of the script using the tensorflow backend. It processes the input only once and has no cracks in the sound. separate.zip

Edit: The script has seen many improvements in the meantime (special thanks to goes to @redbar0n) and is now available in its own repo here!

Shupacabras commented 4 years ago

Hi all

I had to modify


# concatenate the parts and convert the result to $EXT

#ffmpeg -i separated/"$NAME"/vocals.wav separated/"$NAME"/vocals.$EXT

#ffmpeg -i separated/"$NAME"/drums.wav separated/"$NAME"/drums.$EXT

#ffmpeg -i separated/"$NAME"/bass.wav separated/"$NAME"/bass.$EXT

#ffmpeg -i separated/"$NAME"/piano.wav separated/"$NAME"/piano.$EXT

#ffmpeg -i separated/"$NAME"/other.wav separated/"$NAME"/other.$EXT

because deleted the separate files and modify


# clean up
#rm separated/"$NAME"/vocals.wav
#rm separated/"$NAME"/drums.wav
#rm separated/"$NAME"/bass.wav
#rm separated/"$NAME"/piano.wav
#rm separated/"$NAME"/other.wav

I consider redundant operation.

Sorry my bad english

avindra commented 4 years ago

Nice work @amo13 . If I can make one suggestion, rather than doing the work of splitting up the input files, one can use the -s (aka --offset) option to the separate command. (Source follows):

https://github.com/deezer/spleeter/blob/243b3236adaf0101b3c4ffd5ad37c2c4c731b04f/spleeter/commands/__init__.py#L54-L60

This way, you can process a single file iteratively using spleeter, rather than splitting it up manually beforehand.

geraldoramos commented 4 years ago

@amo13 Hey there, thanks a lot for providing your stitching code example. It gives good results but the transition between the 30 seconds blocks is noticeable, it looks like FFmpeg is adding a small gap between concatenations, do you experience this as well? Or is this something related to my setup?

Here is an example (used your code to separate the tracks, then mixed all the tracks back using Logic.

Logic interface showing the small gap every ~30s

image

Here is an audio sample demonstrating the gap (you should notice it between 2 and 3 seconds)

https://www.dropbox.com/s/1f0qz92yaoqedhl/gap.mp3?dl=0

PS: I've used your most updated code example.

Thanks!

amo13 commented 4 years ago

Without testing your example on my setup, I can remember that I also had tiny bits of something added at every junction. The split, separated and joined audio was always a tiny bit longer than the original. It is certainly not perfect, but I am sure enough that I did not notice by hearing. I did the hearing test on different music tracks and movie trailers from YouTube. I needed to zoom in very close to see the problem but I was never able to notice by hearing.

amo13 commented 4 years ago

@geraldoramos I just listened to your audio sample and I can confirm that I do not get gaps like that with my script. I have no idea why you would get those though...

mickdekkers commented 4 years ago

@geraldoramos are the files you're concatenating MP3s? If so, these answers on Stack Overflow may explain what's happening:

That first answer also links to this page, which goes into more detail (emphasis mine):

Most lossy audio compression schemes add a small amount of silence to both ends of the audio. Due to the introduction of such gaps, the duration of the output is slightly increased. Silence at the beginning is called delay and silence at the end is padding.

If the amount of encoder delay and padding are not all accurately accounted for, the encoded silence will be decoded together with the audio data, creating gaps at the ends of the track. Likewise, if the decoder delay is not accounted for, the gap at the end will be further enlarged.

This issue is technical but also standards-related. The popular MP3 standard, for example, defines no way to record the amount of delay or padding for later removal. Encoder delay may vary from encoder to encoder, making automatic removal difficult.

If this is in fact what's happening, I think you should be able to eliminate the gaps by using a lossless format (e.g. FLAC) or a lossy format with "gapless encoding" support for the segments and concatenating those. If I'm reading that page right, I also think you can safely convert the file to MP3 after concatenating, if that's the output format you want.

amo13 commented 4 years ago

I also use mp3 as input format, just stuff downloaded with youtube-dl to mp3. The script converts the input mp3 to wav and works with wave from there on, splitting in parts, separating with spleeter, putting back together the pieces and finally only does it convert back to the input format, eg mp3

geraldoramos commented 4 years ago

Thanks, guys! @amo13 and @mickdekkers

I've replaced FFmpeg with sox "$FILE" "$NAME-.wav" trim 0 30 : newfile : restart to test out, but the gap is still there, maybe it's happening on the concat phase and not on the split.

I've tried mp3, wav, flac, and all end up with the same issue. It's super weird that @amo13 is not having the same issue with his setup.

I'm using docker using miniconda FROM continuumio/miniconda3:4.8.2

I will keep digging and will let you guys know if I make any progress.

@amo13 if you can send an output example from your end it would be awesome, so I can analyze the audio as well and compare it with mine.

amo13 commented 4 years ago

@geraldoramos Ok, so I downloaded this trailer from youtube as an mp3: zizek-trailer.zip Then I called my script with this mp3 as argument and I get this vocal stem in return: vocals.zip It actually is around 0.08s (80ms) longer, but without clicks, peaks or anything else actually noticable by hearing. The comparison of the input and output (vocal stem) waveforms looks like this: Screenshot from 2020-06-29 15-47-20

I hope it helps figuring out the issue =)

geraldoramos commented 4 years ago

Hi @amo13, after examining your output audio (vocals.zip), it seems like it also has the small milliseconds gap (that I can notice if listening carefully), please take a look at the waveform with a lot of zooming.

image

I can notice when transitioning from 30-31seconds that audio has a hiccup. This is exactly the same gap I found in my experiments. It's more noticeable if it's music and if you mix all the processed stems and play it like the original.

Looks like this extra space is being added by Spleeter. I've tried splitting a song in 30 seconds chunks without the Spleeter part and connecting it back using sox and FFmpeg, using wav files. For those, I see perfect stitching without this mini gap. This makes me think that Spleeter processing is somehow adding this very small padding at the end of each steam.

I'm always using lossless files (wav) everywhere during these tests so it will prevent the things @mickdekkers pointed out.

Let me know what you think.

Geraldo

I've tried the latest Spleeter version as well as 1.4.8, and it is happening on both.

amo13 commented 4 years ago

@geraldoramos , good that you take a closer look, I guess you are right. Using only wav is smart and certainly prevents potential mp3 issues with padding. As for the added padding time, I agree, the only possible conclusion is that spleeter is adding it somehow. I am using versoin 1.5.2 (py37hc8dfbb8_0) from conda-forge.

amo13 commented 4 years ago

You might want to open a new issue and reference the last parts of this discussion here. Would you please link to this issue if you do so?

geraldoramos commented 4 years ago

Sounds good, will do it shortly!

geraldoramos commented 4 years ago

@amo13 Bug reported here: #437

redbar0n commented 3 years ago

The latest version of this bash script can now be found at: https://github.com/amo13/spleeter-wrapper

redbar0n commented 3 years ago

@avindra

Nice work @amo13 . If I can make one suggestion, rather than doing the work of splitting up the input files, one can use the -s (aka --offset) option to the separate command. (Source follows):

https://github.com/deezer/spleeter/blob/243b3236adaf0101b3c4ffd5ad37c2c4c731b04f/spleeter/commands/__init__.py#L54-L60

This way, you can process a single file iteratively using spleeter, rather than splitting it up manually beforehand.

I've thought a bit about this, and it looks like this would be sub-optimal:

and against Spleeter's recommendation to batch process:

Quotes from: https://github.com/deezer/spleeter/wiki/2.-Getting-started#batch-processing

Correct me if I'm wrong.