deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
25.63k stars 2.81k forks source link

[Discussion] About training models myself #740

Open xiebruce opened 2 years ago

xiebruce commented 2 years ago

Below is the content from Here and I've read it.

Train model

For training your own model, you need:

spleeter train -p configs/musdb_config.json -d </path/to/musdb>

From the command above, I notice that I need to provide:


Question 1

A musdb_config.json file look like below, copy from here

{
    "train_csv": "configs/musdb_train.csv",
    "validation_csv": "configs/musdb_validation.csv",
    "model_dir": "musdb_model",
    "mix_name": "mix",
    "instrument_list": ["vocals", "drums", "bass", "other"],
    "sample_rate":44100,
    "frame_length":4096,
    "frame_step":1024,
    "T":512,
    "F":1024,
    "n_channels":2,
    "n_chunks_per_song":40,
    "separation_exponent":2,
    "mask_extension":"zeros",
    "learning_rate": 1e-4,
    "batch_size":4,
    "training_cache":"cache/training",
    "validation_cache":"cache/validation",
    "train_max_steps": 200000,
    "throttle_secs":1800,
    "random_seed":3,
    "save_checkpoints_steps":1000,
    "save_summary_steps":5,
    "model":{
        "type":"unet.unet",
        "params":{
               "conv_activation":"ELU",
               "deconv_activation":"ELU"
        }
    }
}

But where can I get the full documentation of it? For example, what does T and F mean? for instrument_list, can I only use ["vocals", "other"]? where can I get the full documentation of all these config options?


Question 2

I've downloaded the musdb from musdb18.zip and extract the zip file, I found that it is a folder containing 2 folders: train and test(see the screenshot below)

image

Inside the train and test folder, there are all mp4 files(use mp4 instead of mp3 or aac is because mp4 can contain more than one track in it)

image image

I've listened the mp4 file in train folder and test folder in musdb18.zip, it seems they are no any difference, they are all songs.

So in my understanding, they are no difference. Assuming that I have 150 song file, can I choose 100 of them for training and the rest for validation?


Question 3

I use ffprobe to check the mp4 files mentioned above, I found it has many tracks, the first track is a mix of all audio tracks, the other audio tracks are separated tracks(vocals, drum, bass, .etc), and the last track is a video track, but in fact it has no video, it's a still png image.

ffprobe -hide_banner -i  Young\ Griffo\ -\ Pennies.stem.mp4
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7f8bfc808200] stream 0, timescale not set
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'Young Griffo - Pennies.stem.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isom
    creation_time   : 2017-12-16T17:34:20.000000Z
  Duration: 00:04:37.80, start: 0.000000, bitrate: 1288 kb/s
  Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:2(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:3(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:4(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:5: Video: png, rgba(pc), 512x512 [SAR 20157:20157 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)

Now I have vocals.m4a and bg-musics.m4a, how can I merge a vocal and it's corresponding bg-music and a album cover image to a mp4 file by using ffmpeg?


Question 4

From question1 we can know that we also need 2 csv file: musdb_train.csv and musdb_validation.csv.

I notice that the musdb_train.csv file has 6 columns:

mix_path | vocals_path | drums_path | bass_path | other_path | duration

If I only needs 2stems, it that I only need to provide these 4 columns in this csv file?

mix_path | vocals_path | other_path | duration
romi1502 commented 2 years ago

Hi @xiebruce, For your first question, you have some information about parameters set in config files in the wiki. It is possibly a bit incomplete though. Regarding the instrument list, it should be set according to the dataset you want to train on. For instance with musdb, you can possibly use ["vocals", "instrumentals"] as you have an both instrumentals.wav and vocals.wav file for every track and that they will sum up to mixture.wav.

For your second question, spleeter was not made for dealing with the multi-stem format *.mp4, so you should use the multiple waveform version of musdb. You can do a different split than the original proposed musdb one, which is only provided for algorithm comparison puposes. So if you don't plan to compare your model with other model on the test set, you can possibly use songs of the test set in your training.

The third question concerns musdb and not spleeter, so this is not the right place for asking/answering it.

For your 4th questions, indeed you can provide only two columns if you'd like to perform 2 stems separation. As mentioned in the answer to question 2, you need to ensure that the provided stems sums up to the mix (e.g. the sum of the stems is equal to the mix). With musdb, you can do it with the instrumentals stem and the vocal stems.

isolepinas commented 2 years ago

Hello! I am doing the same but with Beethoven Cello Sonatas.

How many hours of samples/data are you using to train spleeter?

Thanks!

xiebruce commented 2 years ago

@isolepinas Sorry, I still don't know how to do it yet. But I think this is depends on your computer's performance and the sample data size. Can your share the whole process, the step that you are doing? I prefer use examples and screenshots rather than just describing, thank you in advance.

isolepinas commented 2 years ago

Dear Bruce,

Just like you I am in the first steps of the process. My idea is to feed to spleeter 3 versions of a performance (piano solo),(cello solo) and (piano and cello together), I aim to train spleeter to understand whats a piano and whats a cello and when together, it should be able to only extract cello without losing vibrato, portamento and other characteristics.

Some studies have used 44 samples or 1h and 14 minutes such as here https://veleslavia.github.io/conditioned-u-net/

Maybe you will find it interesting!

lets keep in touch so we can share our findings and processes!

On Sat, 26 Mar 2022 at 10:50, Bruce @.***> wrote:

@isolepinas https://github.com/isolepinas Sorry, I still don't know how to do it yet. But I think this is depends on your computer's performance and the data size. Can your share the whole process, the step that you are doing? I prefer use examples and screenshots rather than just describing, thank you in advance.

— Reply to this email directly, view it on GitHub https://github.com/deezer/spleeter/issues/740#issuecomment-1079660733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYNKUM3RWJD55X6KK73LQNLVB3TVXANCNFSM5QYFB33Q . You are receiving this because you were mentioned.Message ID: @.***>

xiebruce commented 2 years ago

@isolepinas OK, thank you.