deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
25.76k stars 2.82k forks source link

[Discussion] Spleeter for real-time applications #276

Open gvne opened 4 years ago

gvne commented 4 years ago

Hi ! This issue is a follow-up of a gitter discussion. I'm developing a C++ port of spleeter here as a side project. My goal is to give people the opportunity to use the spleeter technology within plug-ins. Lately, I realized that the architecture is built to run in batches of size 'T' which is 512 fft frames in the pre-trained models (~12seconds). This kind of latency isn't suitable for real-time processes. To check if the architecture is suitable for my needs, I need to evaluate the quality loss of changing that value to something lower. My plan is to train models for 2 and 4 stems with multiple values of T and compare their quality.

I created a repository to report on that, as I assume it may be of interest to others. And also to get some help in the process.

I already have a couple of questions: First: I am using MUSDB18. I noticed that the training configuration needs the description of the evaluation set. Considering that I will be using that set to evaluate the trained models, I would like to make sure that training does not use the validation set as it would mean that I am evaluating on data the network already knows. Is the evaluation_csv parameter used during training ?

Second: I am struggling with computation power. For my first test, it took almost 4hours to run 100 000 steps on a p2.xlarge AWS instance (GPU Tesla k80). Is this expected ? Do you think that, considering my problem I could lower the train_max_steps ?

Third: This one is more of a note than a discussion but I ran the evaluation using python -m spleeter evaluate -p spleeter:2stems --mus_dir [exported db dir] -o [local path] and got NaNs. Is the evaluation system broken atm ? I changed it in a fork here to fix it for my case. Should I do a pull request for that ? 

Thank you for your help and for publishing your work !

aidv commented 4 years ago

How many files are in the MUSDB18 dataset?

I have a dataset of 300 files and I can train it in 6-7 hours on an RTX2080.

romi1502 commented 4 years ago

Hi @gvne!

First, be aware that you don't necessarily need to retrain the system. You can change the size 'T' at test time as long as it remains a multiple of 64 (which corresponds to 1.5s). That will change a bit the performances especially at the segment borders, but if you want to do realtime separation (then causal separation), you won't be able to take advantage of the future audio samples in anyway and then you'll need to cope with it. You can evaluate straightforwardly spleeter with shorter T by just updating the corresponding config json file.

If you want to go below T=64, you'll need both to retrain the system and to change the architecture of the network (reduce the number of layers) and you won't be able to take advantage of the data we used for training. And I suspect that you may not be able to use much shorter segments without strongly degrading separation results.

What I would recommend is to perform overlap and add with a short T that does not need retraining (for instance T=64). You would then process 1.5s of the past audio and just output the as short as you want current frame of audio, the size of which would be mainly set by the computation power of your system. Do you have an idea of a many T-long segment you can process per second ? (note that doing STFT/iSTFT outside tensorflow may make things faster on CPU).

Regarding your questions: 1) the validation_csv is used for monitoring validation cost. While it has no impact on the final model (there is no early stopping mechanism), it is useful to check that your model is not overfitting (which would results in a validation cost going up at some point). The easiest way to visualize it is to use tensorboard.

2) Training is quite intensive. 4 hours for 100000 steps does not seem that much to me. You can monitor the GPU your using to make sure it is actually well exploited.

3) The evaluation system is not supposed to be broken. I need to check it then :).

gvne commented 4 years ago

Hey @romi1502, nice to read from you around here :)

You can change the size 'T' at test time as long as it remains a multiple of 64

Well, I surely didn't get that ! I was sure it was a training parameter that configured the network input size. My bad. At least it means I don't need to retrain the models which is a very good news. Those AWS GPU instances aren't cheap...

If you want to go below T=64, you'll need both to retrain the system and to change the architecture of the network (reduce the number of layers) and you won't be able to take advantage of the data we used for training. And I suspect that you may not be able to use much shorter segments without strongly degrading separation results.

That matches what mmoussallam said on gitter. I'll keep that option for later. I'm no Deep Learning expert and that's not what I was expecting to do when I started the project.

What I would recommend is to perform overlap and add with a short T that does not need retraining (for instance T=64). You would then process 1.5s of the past audio and just output the as short as you want current frame of audio, the size of which would be mainly set by the computation power of your system. Do you have an idea of a many T-long segment you can process per second ? (note that doing STFT/iSTFT outside tensorflow may make things faster on CPU).

I'll try that out ! I think that it's the right first step anyway. I'll keep the T and overlap rate between windows of T frames as high level parameters. I want that tool to work on any machine so I can't really force a user into using a specific setup.

Thank you so much for your time. You answered my questions perfectly ! I'll close the issue for now.

gvne commented 4 years ago

Hi there, just wanted to follow up on that matter. I followed @romi1502 's suggestion and successfully implemented a very simple volume control VST3 plugin running in real time with spleeter ! You can find the plug-in code right here. I also provide a pre-build for OSX (tested on 10.14 and 10.15) here.

As expected the latency is the worst. 64 frames for spleeter and a couple of extra frames to leave enough time for the process to run properly (setup to 10 on the pre-built if I'm not mistaking). That leads to a latency close to 2seconds... But still, playing with those sliders is so much fun ! 😃

As a side note, I didn't release a new spleeterpp version that includes the on-line processing yet (the code is available in the develop branch though). I still need to update the documentation with details about the algorithm. There are quite a few parameters after all. I also need to assert the equality with the classic process.

Anyway, thank you once again for releasing your work. If you ever have further suggestions to improve this integration, I'd be very happy to read them !

aidv commented 4 years ago

@gvne I'm trying the VST3 file on MacOS on Studio One as my DAW, it does not seem to be recognized. Do you know what could be the issue?

Edit: Where can I find the development branch of spleeterpp?

Edit; Nvm, I thought github would show it but it doesn't.

mmoussallam commented 4 years ago

Hi @gvne

That's just great! We'll take a look at the VST!

junh1024 commented 4 years ago

I think it might be better to go algorithmic instead of ML for realtime.

https://github.com/tachi-hi/euterpe/ is realtime for vox, but unsure of latency. (I think it's algorithmic) https://www.yellownoiseaudio.com/ is realtime for drum, latency is .5s. (maybe ML, idk)

aidv commented 4 years ago

@gvne if you have some time to assist me in getting vstSpleeter running on MacOS using JUCE, please do help me.

I've been trying to build it all day.

gvne commented 4 years ago

@aidv I answered in the issue you created on the vst repository. That is probably the best place to discuss build details. That would help keeping this issue as a place for discussing potential algorithm improvements.

@junh1024 Thank you for your suggestion. I agree that overall the spleeter architecture isn't suitable for real-time applications. We discussed that with @mmoussallam on gitter a couple of weeks back. I still decided to give it a try as I figure some people may not care that much about latency. And some other may accept it as long as it is free and open source 😄 My goal here really is to find the best approach to twist spleeter into a real-time application. Even if it wasn't designed for it.

aidv commented 4 years ago

@gvne Yes I think that it's best for the project. Thank you.

spereree commented 4 years ago

Hi, I think this is really interesting & I applaud the work already done for real-time processing! As for the user's hardware, I suggest you let them decide (like how Isola Pro Fx did on their VST Gui) - Because some of us would value the benefit of this plugin enough to buy 1 (or 2, 3, etc) GPU's just to run it at a lower, usable latency (~50ms or less) ... For instance: Virtual DJ supports VST plugins/effects. I could pull this up on deck A playing a track, and fade out drums, bass, etc to have an INSTANT acapella version to mix it with another, say, INSTANT instrumental on deck B, creating a live mashup, etc... So no need of rendering files! :-) If I need 4x Radeon VII's to achieve this, I'll get them!!! So basically, thank you and keep up the good work! We are waiting for this to drop to usable latencies - Even if we need Titans & Threadrippers to get it done lol!

aidv commented 4 years ago

@spereree VST technology doesn't work like that.

The time it takes for data to move from CPU to GPU and back to CPU is too much for it to work, that's why there's no GPU VST's on the market.

So no amount of GPU's would fix that problem.

I have looked in to @gvne code and I have managed to get the latency down to 8ms, but I still experience a delay in my DAW.

I'm still inspecting the issue to try to find a good solution.

spereree commented 4 years ago

Oh, didn't know that about VSTs & GPUs, thanks for the clarification! So you managed to get it down to 8ms? That's great! What about the audio quality, how does it compare to non-realtime? Besides the DAW delay, were there any other issues you came across? Does CPU core count help? Can you post what you have done so far so I also give it a try on my DAW or VDJ? Many thanks again!

aidv commented 4 years ago

@spereree Audio quality is not great due to the model being used. It isn't using the 16kHz models, which I've asked @gvne to have a look at.

I have not yet doe any tests on different machines with different cores, but I'm assuming that multicore is not used for a single VST. Maybe I'm wrong. However multiple VST's will most likely be assigned to their respective cores.

Here's the VST: https://filebin.net/kjzyn56tqsm9r08a/spleeter-vst.vst3.zip?t=qmfhgehq

junh1024 commented 4 years ago

the 16khz is a fake extension https://github.com/deezer/spleeter/issues/2#issuecomment-548798493 . I suggest removing the separated stem(s) (excl. others) from the input, that way you have unity.

DAWs may not support PDC properly so it's important to say which DAW you use if you have latency problems. Another possibility is that the VST doesn't implement PDC properly. 3rd possibility: offline & online behaviors differ. I have seen 2-3 happen & 1 is widely reported with Live8-.

ED: there are VST algorithmic voice removers, https://www.cloneensemble.com/vt_main.htm , but it's 2005 tech so it sounds bad & has PDC/unity issues. I suggested https://github.com/tachi-hi/euterpe/ above which sounds better, but it's not VST.

aidv commented 4 years ago

@junh1024 I'm on Studio One.

Do you have any suggestions on how better PDC can be achieved in JUCE? Would love to implement a fix if possible.

I don't think that other algorithms are relevant here.

@gvne project vstSpleeter is in my opinion a great proof of concept on how Tensorflow models could be turned into VST's.

If anyone fails to see how this will revolutionize the music production industry, then take notes, you don't want to fall behind on this opportunity.

junh1024 commented 4 years ago

S1 should be OK. Accordingly to https://github.com/deezer/spleeter/blob/master/configs/5stems/base_config.json#L8 , the FFT size is 4096, so the latency should be 4096sa or 4096/44100 = 92ms. Can you explain how you got it to 8ms?

I;m not experienced with PDC in juice/c++, but in JSFX, PDC is very manual. You set the PDC & the DAW will give you x sa in advance, and you delay your output, and give out 1 sa at a time.

I don't think it's a revolution, more like workflow enhancement. BTW, izotope had a AAX "RX7 music rebalance" in 2018 & VST "Ozone 9 master rebalance " in 2019.

james34602 commented 4 years ago

Since Spleeter U-Net is so simple, I do a full rewritten version in C for real-time(Continuous STFT buffered) lag-less(no frame drop for modern CPU) VST inference. The initialization function allow developer to choose how many frames and frequency bin limits. https://github.com/james34602/SpleeterRT

Scylla2020 commented 4 years ago

I realise various companies have implemented real time separation already so just wondering if they are using a variation of spleeter or there's other ai projects out there? Just tested the free new Virtual DJ version and it allows real time separation, with tracks loading instantly. So surely there is a way to improve the current spleeter? My pc with gpu takes quite some seconds but these new programs are doing it instantly, even works on an ipad for the djay Pro app.

aidv commented 4 years ago

Probably just Spleeter with audio buffer streamed to a background service.

But I could be wrong.

sön 12 juli 2020 kl. 11:11 skrev Scylla2020 notifications@github.com:

I realise various companies have implemented real time separation already so just wondering if they are using a variation of spleeter or there's other ai projects out there? Just tested the free new Virtual DJ version and it allows real time separation, with tracks loading instantly. So surely there is a way to improve the current spleeter? My pc with gpu takes quite some seconds but these new apps are doing it instantly, even works on an ipad for the djay Pro app.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deezer/spleeter/issues/276#issuecomment-657195112, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIREEEHMDA5DLVABD43XOTR3F43FANCNFSM4KZ2E4XA .

-- Sincerely, Aid vllasaliu

james34602 commented 4 years ago

@aidv is right. SpleeterRT is continuous STFT buffered, lag-less(no frame drop for modern CPU). But not real-time algorithmically. @Scylla2020 I'll tell you what, true real-time is not possible without tricks, even all commercial high-quality separator is not real-time, iZotope Music Rebalance probably use large buffer pre-caching to give you the illusion of real-time process, but what's the tricks, I can't reveal.,

Scylla2020 commented 4 years ago

@aidv @james34602 Oh i see. So this buffer trick would just have been reimplented in different languages as per the target environment, for example Virtual Dj could have rewritten the functionality in C++ for speed and to avoid using python?

Scylla2020 commented 4 years ago

So really there is no escaping/speeding up the real time it actually currently takes to separate a track from start to end?

james34602 commented 4 years ago

@Scylla2020 Avoid using Python is totally the point, beside Python, SpleeterRT offer flexible intrinsic latency adjustment(How many time frame for the neural network to inference). And well parallelized 4 stems inference for my SpleeterRT.

The main advantage C neural network is mainly make you easier to debug and integrate your software, link and loading DLL of libtensorflow can be very slow. In SpleeterRT, The buffering and STFT system already written for real time buffering system that require for consistent processing.

Scylla2020 commented 4 years ago

@james34602 So SpleeterRT will take just as much time if it was processing a track just like how Spleeter does? Or is it just for real time applications. Im not too familiar with C

james34602 commented 4 years ago

@Scylla2020 In short yes. SpleeterRT offer background processing, corresponding to Spleeter4Stems.c. If you use spleeter.c directly for offline inference, you may see sightly faster inference compared to tensorflow-cpu.

aidv commented 4 years ago

@Scylla2020 I haven't used the DJ software, but I suspect they are misusing the word "realtime".

Obviously as a DJ you don't stream any audio, you just load the file, right? The file is then processed through Spleeter and then the stems are made available.

It's not realtime. They just say that for marketing. It takes probably 30 seconds to process a 3 minute song on a laptop, and thus in theory it could be considered realtime, actually faster than realtime, but I doubt they stream the audio buffer through Spleeter.

It just doesn't make any sense why they would do that. Just separate the stems once in one go and save the stems.

james34602 commented 4 years ago

@aidv @Scylla2020 DJ software marketing "real time" is not good, but not wrong in definition, "real time" definition first appear in computer as a measures of reaching deadline, however, the definition in computer doesn't suit audio, because audio system is streaming, mean that if you can't provide enough samples for a system to process, the system simply can't output anything. In Spleeter cases, if you don't provide enough T frames, the neural network won't inference anything.

junh1024 commented 4 years ago

@Scylla2020 I believe VDJ/djay are using classic FFT algorithmic things to do realtime separation e.g, maybe things like this , also see my previous comments in this thread. It's been possible for a few years, but no-one's interested because it's hard, it actually required understanding of instruments, and you can quickly achieve results for ML given enough GPUs.

Nothing spleeter or ML based, that's too slow for 4x songs on an ipad, and not full bandwidth.

RE: not realtime, well, every audio goes through buffers, I consider <1s & not saturating 1 core realtime.

james34602 commented 4 years ago

@junh1024 The site you provided sound like HPSS algorithm, which is almost real time in algorithmic sense.

For real time-ness: I guess you understand "Processing time" and "Algorithmic latency". Assume a supercomputer can compute anything in the world in 1 ns, it going to run a algorithm with algorithmic latency of 3 seconds, supercomputer going to process recorded signal from microphone, the supercomputer still need at least 3 seconds real-world clock to output result, no matter how fast your supercomputer is.

Scylla2020 commented 4 years ago

@aidv @junh1024 I doubt they using old tools as all this has come about soon after spleeter so I believe their solution is based on reimplementing spleeter differently. Also the results sound exactly like using spleeter. Since you mention the djying software would have already loaded and processed the track to separate stems, there must be a new super fast solution then thats way better than current spleeter. My computer takes about 10-15 seconds for a 3min track using spleeter-gpu yet on virtual dj this separation seems to be almost instantaneous and I can randomly skip to various sections of the track with no issues

aidv commented 4 years ago

So, like I’ve said, I don’t own that software and I’m only speculating, chopping up the song into smaller pieces and processing only the parts relevant to where you’re at in the timeline surely will feel like instantaniously.

It’s not rocket science.

If 3 minutes take you 10 seconds to process using Spleeter on the cpu, that would be 18x faster than realtime, so while DJing you’re probably looking at 2.5s before and 2.5s after the transport position.

So 5s in total to process through spleeter would take 0.27 seconds, and that will feel like instant.

mån 13 juli 2020 kl. 07:35 skrev Scylla2020 notifications@github.com:

@aidv https://github.com/aidv @junh1024 https://github.com/junh1024 I doubt they using old tools as all this has come about soon after spleeter so I believe their solution is based on reimplementing spleeter differently. Also the results sound exactly like using spleeter with its default parameters. Since you mention the djying software would have already loaded and processed the track to separate stems, there must be a new super fast solution then thats way better than current spleeter. My computer takes about 10-15 seconds for a 3min track yet on virtual dj this separation seems to be almost instantaneous and I can randomly skip to various sections of the track with no issues

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deezer/spleeter/issues/276#issuecomment-657368656, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIREEE2O3F2WMGWQNM4J53R3KMLDANCNFSM4KZ2E4XA .

-- Sincerely, Aid vllasaliu

Scylla2020 commented 4 years ago

@aidv Ok thanks I got it now. It wasnt clear from your previous comment what you meant about loading it did at least to me seem like you meant the whole song gets loaded at once. Everyone is just trying to learn here no one said its rocket science. Be gentle

aidv commented 4 years ago

What I mean with it’s not rocket science is that it’s not as complicated as you might think. There is a very logical explanation as to how they might have achieved their performance, thus, no need to overthi nk it, hence, it’s not rocket science.

It’s not even machine learning at that point.

mån 13 juli 2020 kl. 08:06 skrev Scylla2020 notifications@github.com:

@aidv https://github.com/aidv Ok thanks I got it now. It wasnt clear from your previous comment what you meant about loading it did seem like you meant the whole song gets loaded at once. Everyone is just trying to learn here no one said its rocket science. Be gentle

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deezer/spleeter/issues/276#issuecomment-657378301, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIREEB53KNNREEOSWLN65TR3KP7XANCNFSM4KZ2E4XA .

-- Sincerely, Aid vllasaliu

james34602 commented 4 years ago

Also the results sound exactly like using spleeter.

@Scylla2020 Not necessary, given enough dataset to train, most STFT-based BSS sound similar, you can check your DJ software program files to see the model file, file size tell if they are actually Spleeter behind the scene.

junh1024 commented 4 years ago

@Scylla2020

I doubt they using old tools as all this has come about soon after spleeter so I believe their solution is based on reimplementing spleeter differently.

They are making NEW tools, which I believe are based on OLD ideas. Maybe something like https://github.com/tachi-hi/euterpe/ . Just because Y came after X, it doesn't mean that Y HAS to be based on X. As I said, i believe it's not ML, nor spleeter cuz it's too slow, & you can't make it much faster. It' NOTHING to do with spleeter, it just happens to do the same thing. "A Bus goes faster than a horse, maybe it's reimplementing a horse" no, it's a different thing. They just both happen to take people from A to B. Extreme example but OK

ED: spleeter readme says some commercial implementations are based on spleeter so I might be wrong.

rainOutside commented 4 years ago

Hi @gvne!

What I would recommend is to perform overlap and add with a short T that does not need retraining (for instance T=64). You would then process 1.5s of the past audio and just output the as short as you want current frame of audio, the size of which would be mainly set by the computation power of your system. Do you have an idea of a many T-long segment you can process per second ? (note that doing STFT/iSTFT outside tensorflow may make things faster on CPU).

Regarding the solution for short latency, I'm wonder, what will happens if I will take long T (512) but will use only the last vector of the mask to reconstruct one frame from the input spectrogram (this should occurs every frame_step=1024 samples). of course, this will consumes a lot of CPU. but basically is it should gives with the same performance? If it works the latency will be (frame_length)/fs = 4096/44100 ~=90ms + processing time depends on the CPU/GPU. Actually, is right now there is an overlap between the frames of the spectrogram (in the inference)? or each input spectrogram
produces a mask without depending the neighbor spectrograms?

james34602 commented 4 years ago

Hi @gvne! What I would recommend is to perform overlap and add with a short T that does not need retraining (for instance T=64). You would then process 1.5s of the past audio and just output the as short as you want current frame of audio, the size of which would be mainly set by the computation power of your system. Do you have an idea of a many T-long segment you can process per second ? (note that doing STFT/iSTFT outside tensorflow may make things faster on CPU).

Regarding the solution for short latency, I'm wonder, what will happens if I will take long T (512) but will use only the last vector of the mask to reconstruct one frame from the input spectrogram (this should occurs every frame_step=1024 samples). of course, this will consumes a lot of CPU. but basically is it should gives with the same performance? If it works the latency will be (frame_length)/fs = 4096/44100 ~=90ms + processing time depends on the CPU/GPU. Actually, is right now there is an overlap between the frames of the spectrogram (in the inference)? or each input spectrogram produces a mask without depending the neighbor spectrograms?

Quality will be worsen, since the information you use are only from the past, waste too much computation that does not contribute to the output. The CPU usage will going to be impractical, approximately, 20 minutes processing time for 1 minute song on CPU. 1 minutes processing time on GPU for a 3 minutes song.

When processing time far higher than the theoretical latency, what do you need?

rainOutside commented 4 years ago

Hi @james34602 , Thanks for the quick reply.

What I try to do is also implement Spleeter for real-time with low latency. May you tell me how much information from the future does Spleeter takes right now? is it an half length of the spectrogram (256 frames ~=6 sec)? I'm trying to understand it from the code but it's a bit hard for me to follow the process , since it's written in TensorFlow . Thank you very much.

james34602 commented 4 years ago

Hi @james34602 , Thanks for the quick reply.

What I try to do is also implement Spleeter for real-time with low latency. May you tell me how much information from the future does Spleeter takes right now? is it an half length of the spectrogram (256 frames ~=6 sec)? I'm trying to understand it from the code but it's a bit hard for me to follow the process , since it's written in TensorFlow . Thank you very much.

Hello. Default buffering mode assume to take full future information, e.g. T=64, 63 future frames, 1 present frame. Ok, well, it's not about Tensorflow, the way Tensorflow behave is the same as Matlab and many others. T in Spleeter mean time step of frame "jumping".

rainOutside commented 4 years ago

@james34602 what you say is that the decision for a fft frame (the present one) relays only on the future frames (the next 63 frames in your example), right?

It's a bit incomprehensible to me. because if the decision is for a frame from the edge of the spectrogram figure, why it does matter if we take the rest frames from the future (say 'right' side) or from the past (the 'left' side of the figure)? I thought that the decision is for the middle frame, such that we are utilizing the neighbor frames both from the past and the future. (I mean, for T=64, the decision is for the 32th frame, by utilizing the next 32 frames and the last 31 frames).

T in Spleeter mean time step of frame "jumping".

I think T mean the number of fft frames in the spectrogram. but not the "jumping". T maybe 512, with jumping of one frame, no? or I'm missing something?

Thanks

james34602 commented 4 years ago

@james34602 what you say is that the decision for a fft frame (the present one) relays only on the future frames (the next 63 frames in your example), right?

It's a bit incomprehensible to me. because if the decision is for a frame from the edge of the spectrogram figure, why it does matter if we take the rest frames from the future (say 'right' side) or from the past (the 'left' side of the figure)? I thought that the decision is for the middle frame, such that we are utilizing the neighbor frames both from the past and the future. (I mean, for T=64, the decision is for the 32th frame, by utilizing the next 32 frames and the last 31 frames).

T in Spleeter mean time step of frame "jumping".

I think T mean the number of fft frames in the spectrogram. but not the "jumping". T maybe 512, with jumping of one frame, no? or I'm missing something?

Thanks

T = 64 is one "jump" in my definition, I'm not a trained english speaker. 64 spectral frames == one shot for the inference.

You can experiment yourself, the future and past doesn't matter a lot, the system is non-causal, this is the way network was trained, the left hand side and right hand side happen all at once, what you going to do is assume right hand side frame is present, so you will get real time response, which is true, but like I said the system(network itself) uses information from the left hand side contribute to the right hand side, which is the only information remain inside your real time system.

What your neural network can do when left hand side is empty(Black)? This is why quality must drops, because there have no(or not enough) non-causal information contribute to the demixing. This is very common you got either 1 empty stem track.

Or, what if spectral image of left hand side generates from different vocal singer? The past spectral distribution is no more similar to present? This happens when you change song or the song switching vocal players or any form of spectral distribution changes of specific stem track.

Your propose is no surprise works, I collect a few papers in Chinese describe Causal vs Non-causal inference for blind source separation, papers logically show result of causal frame is worse than the Non-causal one. I guess the past information as long as the time of the universe will ultimately contribute to "perfect" separation respect to the Non-causal counterpart.

Caroanabas commented 3 years ago

How about this realtime project. I really hook on it. Thanks so much.