deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
25.56k stars 2.8k forks source link

[Discussion] Why Two Columns returned by Waveform based Separation? #849

Open s2t2 opened 1 year ago

s2t2 commented 1 year ago

Could you please provide more context about the two columns returned by the raw waveform based separation method?

I noticed there are two columns for each stem, and this is consistent across 2, 4, and 5 stem models.

For example, when we look at the vocals, they are returned in two columns. When I play the data represented by the first column, it sounds like the vocals. When I play the data represented by the second column, it also sounds like the vocals. However their values are slightly different.

So why are there two columns? What is the difference between their values? If we want to represent the vocals, should we use the first column, or second column, or both, or an average?

Thanks!

import numpy as np
from spleeter.separator import Separator

n_stems = 5
model_name = f"spleeter:{n_stems}stems"
sep = Separator(model_name)

splits = sep.separate(audio_data) 
print(splits.keys()) #> ['vocals', 'piano', 'drums', 'bass', 'other']

vocals = splits["vocals"]
print(vocals.shape) #> (661504, 2)
       vocals_0  vocals_1
0     -0.006651 -0.006851
1     -0.006996 -0.007341
2     -0.008933 -0.009481
3     -0.010609 -0.011479
4     -0.009787 -0.010852
...         ...       ...
66145  0.036674  0.036603
66146  0.016356  0.016560
66147 -0.001799 -0.001414
66148 -0.011917 -0.011444
66149 -0.016360 -0.015921
biendltb commented 1 year ago

@s2t2 The two columns are the stereo for the left and right channels. The Spleeter model is trained with stereo input.

You can use both of the channels to represent the vocal. But you can also average them so that you can get mono-channel sounds.

s2t2 commented 1 year ago

@biendltb thanks for the info!

shxnzxxn commented 12 months ago

I compared the values using separate_to_file method and reload with librosa and using separate method and extract average of two columns which locate vocals. But I found these values are not same.

I wonder why it is.