[BUG] Why Training on musdb16hq halts or does not complete

deskstar90 commented 2 years ago

[x] I didn't find a similar issue already open.
[x] I read the documentation (README AND Wiki)
[x] I have installed FFMpeg
[x] My problem is related to Spleeter only, not a derivative product (such as Webapplication, or GUI provided by others)

Description

Step to reproduce

Installed using .pip install python..
Run as administrator...
Got .NO.. error

Output

After I run ...spleeter train -p 4stems-finetune.json -d E:\musdb18hq
I never get the training complete msg.

Environment


OS	Windows10
Installation type	pip
RAM available	16Gb
Hardware spec	GTX 760 GPU / CPU Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz 2.66 GHz

Additional context

I ran training on musdb_config.json and 4stems-finetune.json many times repeatedly and its starts as expected then gets about midway or 3/4 of the ways then just hangs for hours doing nothing. I also tried using only half of the musdb dataset and did the same thing. After 1 week of trying to complete this task, with multiple downloads of the musdb dataset, it still keeps failing. All my datasets contains 100 train / 50 test files. I ran the evaluation on the incompleted training results and my table is way off, I'm not sure what's going on and why its halting midway or 3/4 of the way. Am I missing files?..or corrupt files?..incompatabilities?...I'm running python 3.8.10 and spleeter 2.3.0 Any help or suggestions would be appreciated.

romi1502 commented 2 years ago

Hi @deskstar90, I can't reproduce your issue. Is the musdb data actually found? Do you have info message such as (you may need to use the --verbose option of the CLI to get them): INFO:spleeter:Loading audio b'E:\musdb18hq/train/Triviul - Angelsaint/drums.wav' from 112.352109 to 124.352109 INFO:spleeter:Audio data loaded successfully

Otherwise, it may be a windows specific issue. Then, you may want to try to do this using docker: We currently don't provide up-to-date gpu image, but there are old deprecated ones (with legacy spleeter versions) such as researchdeezer/spleeter:3.7-gpu that should still work.

deskstar90 commented 2 years ago

@romi1502 thank you, I Run both computers as if all things being equal. except PC#1 has 16Gb of ram and GPU, and PC#2 has 32Gb no GPU, but GPU in this case is not used. So, last night I ran the training again on PC #1 and was hoping to provide you with a screenshot of when it hangs with the --verbose option for you to inspect, except it loaded the audio data from the musdb16hq dataset, then it continued onto creating the checkpoints..(wow great..), it never went that far. its been going now almost 24hrs and I don't want to interrupt it. I'm not sure how much longer it will go on, but can you explain the following maybe its not too clear to me;

"train_max_steps": 200000, "throttle_secs":600, "random_seed":3, "save_checkpoints_steps":300, "save_summary_steps":5,

I thought it would save 300 checkpoint steps x 5. Each checkpoint takes about 255sec (~4.2min), which should be about 10.5hrs of checkpoint steps until we reach 5 times, is that correct? or does it have to reach 200000 checkpoints? If I increase the "throttle_secs" to 1800, will that make the process 3x faster?

What are the strictest minimum values required to use for training the 4stems?

deskstar90 commented 2 years ago

@romi1502 , This is the file with the --verbose arguement. After 30hrs it reached 8000 checkpoints, I stopped the process as I don't know if this would take another 20yrs or so to complete. I also find it strange that in #215 the fact that "--verbose" arguement made the training start and without it in the CLI it did nothing.. same as my situation presently with my setups. Maybe you can tell me what's going on. (I snipped the checkpoints part, otherwise its all redundant)

C:\Users\MCCM\AppData\Local\Programs\Python\Python38\Lib\site-packages\spleeter\configs>spleeter train --verbose -p 4stems-16kHz.json -d E:\musdb18hq INFO:tensorflow:Using config: {'_model_dir': '4stems-16kHz', '_tf_random_seed': 3, '_save_summary_steps': 5, '_save_checkpoints_steps': 300, '_save_checkpoints_secs': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.45 } , '_keep_checkpoint_max': 2, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} INFO:spleeter:Start model training INFO:tensorflow:Not using Distribute Coordinator. INFO:tensorflow:Running training and evaluation locally (non-distributed). INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 300 or save_checkpoints_secs None. WARNING:tensorflow:From c:\users\mccm\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\training\training_util.py:235: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Apply unet for vocals_spectrogram WARNING:tensorflow:From c:\users\mccm\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\keras\layers\normalization.py:534: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. INFO:tensorflow:Apply unet for drums_spectrogram INFO:tensorflow:Apply unet for bass_spectrogram INFO:tensorflow:Apply unet for other_spectrogram INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0... INFO:tensorflow:Saving checkpoints for 0 into 4stems-16kHz\model.ckpt. INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0... INFO:spleeter:Loading audio b'E:\musdb18hq\train/Voelund - Comfort Lives In Belief/mixture.wav' from 150.32389551724137 to 170.32389551724137 INFO:spleeter:Loading audio b'E:\musdb18hq\train/Faces On Film - Waiting For Ga/mixture.wav' from 49.4185455862069 to 69.4185455862069 INFO:spleeter:Loading audio b'E:\musdb18hq\train/Snowmine - Curfews/mixture.wav' from 79.33290644827585 to 99.33290644827585 INFO:spleeter:Loading audio b'E:\musdb18hq\train/Jokers, Jacks & Kings - Sea Of Leaves/mixture.wav' from 153.33673779310345 to 173.33673779310345 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/Night Panther - Fire/mixture.wav' from 40.18501048275862 to 60.18501048275862 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Audio data loaded successfully INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/The Districts - Vermont/mixture.wav' from 186.06246172413793 to 206.06246172413793 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/Port St Willow - Stay Even/mixture.wav' from 214.72627244827586 to 234.72627244827586 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/The Districts - Vermont/mixture.wav' from 150.37737293103447 to 170.37737293103447 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/Faces On Film - Waiting For Ga/vocals.wav' from 49.4185455862069 to 69.4185455862069 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/Clara Berry And Wooldog - Stella/mixture.wav' from 72.73108606896552 to 92.73108606896552 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/Voelund - Comfort Lives In Belief/vocals.wav' from 150.32389551724137 to 170.32389551724137 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Loading audio b'E:\musdb18hq\train/The Districts - Vermont/mixture.wav' from 43.32210655172413 to 63.32210655172413 INFO:spleeter:Audio data loaded successfully INFO:spleeter:Audio data loaded successfully

...

Spencer19990618 commented 2 years ago

I come across the same problem, when I run "spleeter train -p configs/musdb_config.json -d ~/data/train/". it seems it did nothing during the training.

Then I tried "spleeter train --verbose -p configs/musdb_config.json -d ~/data/train/", and it is like "WARNING:tensorflow:Training with estimator made no steps. Perhaps input is empty or misspecified".

deskstar90 commented 2 years ago

@Spencer19990618 Mine is training but not completing for some unknown mystery. Yours is not training at all because you're not pointing to the dataset properly.

Spencer19990618 commented 2 years ago

@deskstar90 Hi! I downloaded musdB18 from the website,and the file called musdB18-HQ(I am not sure if it is the dataset you are using). it will be nice, if you can give me some suggestions .

deezer / spleeter