Pickle error running synthesizer_train.py

wood73 commented 3 years ago

I've read that Python multi-threading doesn't work well on windows 10 (and that this repo has better Linux support), so my plan B is to set up a Linux dual-boot for the sole purpose of training single speaker models.

I have the latest version of this repo, with visual studio 2019, CUDA 11.0, the compatable Cudnn version, webrtcvad - I've installed pytorch 1.7.1 with CUDA 11.0 support, and the latest Nvidia drivers (and rebooted my system). torch.cuda.is_available() returns true, and I'm able to run demo_toolbox.py without errors.

I'm testing this on the logs-singlespeaker zip I found somewhere in this repo, and made a simple script to reformat each line in 211-122425.alignment.txt to become a new .txt file, matching it to the correct Flac-file. I cleared the SV2TTS/synthesizer folder to recreate the single-speaker training process, and had no issues generating the files in the audio folder, embeds folder, mels folder, and train.txt - with the commands

python synthesizer_preprocess_audio.py C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root --datasets_name LibriSpeech --subfolders train-clean-100 --no_alignments

python synthesizer_preprocess_embeds.py C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer

Here is the error from running synthesizer_train.py:

python synthesizer_train.py pretrained C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer -s 50 -b 50
Arguments:
    run_id:          pretrained
    syn_dir:         C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer
    models_dir:      synthesizer/saved_models/
    save_every:      50
    backup_every:    50
    force_restart:   False
    hparams:

Checkpoint path: synthesizer\saved_models\pretrained\pretrained.pt
Loading training data from: C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer\train.txt
Using model: Tacotron
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\pretrained\pretrained.pt
Tacotron weights loaded from step 295000
Using inputs from:
        C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer\train.txt
        C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer\mels
        C:\Users\username\Downloads\Real-Time-Voice-Cloning-master2\Real-Time-Voice-Cloning-master\logs-singlespeaker-test\logs-singlespeaker\datasets_root\SV2TTS\synthesizer\embeds
Found 48 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   25k Steps    |     12     |     3e-05     |        2         |
+----------------+------------+---------------+------------------+

Traceback (most recent call last):
  File "synthesizer_train.py", line 35, in <module>
    train(**vars(args))
  File "C:\users\username\downloads\real-time-voice-cloning-master2\real-time-voice-cloning-master\synthesizer\train.py", line 158, in train
    for i, (texts, mels, embeds, idx) in enumerate(data_loader, 1):
  File "C:\Users\username\anaconda3\envs\foodie\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "C:\Users\username\anaconda3\envs\foodie\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\username\anaconda3\envs\foodie\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'

(foodie) C:\users\username\downloads\real-time-voice-cloning-master2\real-time-voice-cloning-master>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\username\anaconda3\envs\foodie\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

ghost commented 3 years ago

Thanks for reporting this issue @woodrow73 . Please try and see if this is reproducible with a normal Python installation, instead of Anaconda. Reference the Windows install instructions if needed. There are several bugs which seem specific to conda (#644, #646) and we don't have enough developer support to squash them.

ghost commented 3 years ago

Same issue reported here, but without Anaconda. https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/472#issuecomment-721874292

@rallandr Did you ever find a solution to the pickle problem when training on Windows?

wood73 commented 3 years ago

Confirming an identical error without a virtual environment. For a workaround I attempted to coerce CPU usage over GPU - short of reinstalling CUDA (since changing file names & environment variables didn't do the trick - maybe it's due to the native pytorch files), I did try manually changing the 10 instances in the repository where torch.cuda.is_available() to False, but .pyc files might make that approach moot; same error message.

Gonna try running with your Ubuntu 20.04 instructions

ghost commented 3 years ago

@woodrow73 As a workaround, can you try setting num_workers=0 in this part of train.py? This makes the DataLoader use the main python process, which will be slower.

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/9a35b3e022d184e7ed2e7ce612413d765f50eb67/synthesizer/train.py#L146-L151

References:

wood73 commented 3 years ago

Thanks for the workaround - just got around to trying it out & initially got a memory error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 4.25 GiB already allocated; 10.70 MiB free; 4.45 GiB reserved in total by PyTorch)

So I edited synthesizer/hparams.py to decrease batch size. synthesis_batch_size didn't seem to have an effect, so I left that at 16, and I changed the batch_size values in tts_schedule to 5 - which seems to be the highest value I can give it without running out of memory.

I timed the training duration for the first 10 epochs(steps?), which took 5 minutes 3 seconds - here's what the console prints for me:

Found 48 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   25k Steps    |      5     |     3e-05     |        2         |
+----------------+------------+---------------+------------------+

{| Epoch: 1/2500 (10/10) | Loss: 0.5695 | 0.33 steps/s | Step: 295k | }
.
.
{| Epoch: 100/2500 (10/10) | Loss: 0.3283 | 0.35 steps/s | Step: 296k | }

The program appears to be using between 20-50% of my Nvidia GeForce GTX 1060 6GB & 19-35% of my i5-6600k. Thanks for the help - this approach seems to result in training at the same speed as ori-pixel's CPU approach with a i5-4690k, if each step = an epoch. I can easily get 400 epochs overnight- I'm thrilled that it's working now; I think I'll still give Ubuntu a try.

ghost commented 3 years ago

Thanks for the update. The training speed seems slow to me. Since neither your CPU nor GPU are at 100%, I think the bottleneck is the storage medium. Try moving your datasets_root/SV2TTS folder to a SSD. Is max_mel_frames=900? You can preprocess with a lower number like 600 which will reduce memory consumption and training time per step.

Your GPU should be capable of 1 step/sec and handle a batch size between 16 and 24 with reduction factor r=2. See if it is faster on Ubuntu.

A quick read you may find helpful: https://docs.paperspace.com/machine-learning/wiki/epoch

wood73 commented 3 years ago

No trouble, thanks for the pointers. I hadn't considered the storage medium a variable, but makes sense with a lot of reading & writing; however, it is already on an internal SSD (850 evo).

Yes, my max_mel_frames are 900 - after preprocessing the data with max_mel_frames = 600, I was able to adjust the batch size to 8. Here's the output after 5min of running:

Found 32 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   20k Steps    |      8     |     0.001     |        2         |
+----------------+------------+---------------+------------------+

{| Epoch: 1/5000 (4/4) | Loss: 12.23 | 0.39 steps/s | Step: 0k | }
.
.
{| Epoch: 34/5000 (4/4) | Loss: 1.010 | 0.47 steps/s | Step: 0k | }

After preprocessing the data with max_mel_frames = 300, I could change the batch sizes to the default of 12 (didn't try higher); and here's the result of running for 5min:

Found 14 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   20k Steps    |      12    |     0.001     |        2         |
+----------------+------------+---------------+------------------+

{| Epoch: 1/10000 (2/2) | Loss: 9.556 | 0.76 steps/s | Step: 0k | }
.
.
{| Epoch: 125/10000 (2/2) | Loss: 0.8832 | 1.0 steps/s | Step: 0k | }

This has me curious what the cost on the output is as a result of altering both max_mel_frames & batch size - as the Epoch description in the link would suggest, a smaller batch size likely creates more noise as it is too small to properly represent all the data. I'm not ML savvy enough (yet) to understand what exactly the console output is communicating in the first part - should 1/10000 (2/2) be alarming?

I'll try it with Ubuntu within the week, after I post my 3-17 second utterance extractor from a single Wav-File (already finished, just wanna clean & document it more). I'm also gonna rig something up to help automate the process of writing down the timestamp of when each word is said (nothing fancy, like playing the audio clip slowly, then pressing a button at the start of each word) - or maybe I'll give in and try a forced aligner like you showed here.

ghost commented 3 years ago

You posted output for these cases (corresponding to max_mel_frames = 900, 600 and 300 respectively):

48 samples, batch size 5, 25k steps remaining until training stage is completed.
32 samples, batch size 8, 20k steps remaining.
14 samples, batch size 12, 20k steps remaining.

In the first case, your dataset has 48 samples. With batch=5, it takes 10 steps to complete an epoch. (Every epoch will have 9 steps with 5 samples, followed by 1 final step with 2 samples.) The training stage has 25k steps remaining, so it will take (25000/10) = 2500 epochs to complete the training stage.
In the 2nd case, dataset=32 and batch=8 so it takes 4 steps to complete an epoch. Since the training stage has 20k steps remaining, you need to complete (20000/4) = 5000 epochs to finish the training stage.
In the 3rd case, dataset=14 and batch=12 so it only takes 2 steps to complete an epoch. Since the training stage has 20k steps remaining, you need to complete (20000/2) = 10000 epochs to finish the training stage.

Notice that lowering max_mel_frames results in your longer samples being discarded due to length. This shrinks your dataset. For example, each mel frame is 0.0125 seconds, so setting max_mel_frames = 300 will keep wav files up to 3.75 sec.

This diagram may be helpful to understand the display:

                 ___ total epochs to complete training stage
                |
                |        ___ total steps to complete epoch
                |       |
{| Epoch: 100/2500 (10/10) | Loss: 0.3283 | 0.35 steps/s | Step: 296k | }
           |        |                                             |
           |        |___ steps completed on current epoch         |
           |             (increments when step completed)         |
           |                                                      |
           |___ epochs completed on current training stage        |
                (increments when epoch is started)                |
                                                                  |
              total steps completed across all training stages ___|

wood73 commented 3 years ago

Notice that lowering max_mel_frames results in your longer samples being discarded due to length. This shrinks your dataset. For example, each mel frame is 0.0125 seconds, so setting max_mel_frames = 300 will keep wav files up to 3.75 sec.

Interesting - aside from being smaller, I imagine there may be additional noise issues as a result of abruptly cutting off words from the Wav-Files. If I'm understanding this correctly, having max_mel_frames = 900 will still cut off Wav-Files at 11.25 seconds? Well, this is definitely yet another motivation to unlock the potential of my hardware via an Ubuntu installation.

Thanks for the info; I'll also post the errors / workarounds here since it's on topic for synthesizer training on windows.

wood73 commented 3 years ago

error running synthesizer_preprocess_audio.py (and almost identical one to running synthesizer_preprocess_embeds.py):

  File "C:\Users\username\AppData\Roaming\Python\Python37\site-packages\torch\__init__.py", line 117, in <module>
    from torch.nn.utils import clip_grad_norm_
  File "C:\Users\username\AppData\Roaming\Python\Python37\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\username\AppData\Roaming\Python\Python37\site-packages\torch\lib\cusolverMg64_10.dll" or one of its dependencies.
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\username\AppData\Roaming\Python\Python37\site-packages\torch\lib\cudnn_adv_infer64_8.dll" or one of its dependencies.
LibriSpeech: 100%|█████████████████████████| 1/1 [00:11<00:00, 11.28s/speakers]
The dataset consists of 32 utterances, 10936 mel frames, 2183520 audio timesteps (0.04 hours).
Max input length (text chars): 330
Max mel frames length: 586
Max audio timesteps length: 117120

Pastebin of the full error logs from both synthesizer_preprocess_audio.py and synthesizer_preprocess_embeds.py Workaround

Another error I encountered afterwards was a Cuda memory error, which I traced to synthesizer_preprocess_embeds.py, where I found the helpful information needed for the workaround coded in the \help command:

parser.add_argument("-n", "--n_processes", type=int, default=1, help= \
        "Number of parallel processes. An encoder is created for each, so you may need to lower "
        "this value on GPUs with low memory. Set it to 1 (default is 4) if CUDA is unhappy.")

ghost commented 3 years ago

If I'm understanding this correctly, having max_mel_frames = 900 will still cut off Wav-Files at 11.25 seconds?

Wav files that are too long are not truncated. Instead, they are dropped from the training set entirely. This can be avoided by using alignment data to split the wavs.

arstropica commented 3 years ago

My workaround for this issue of lambda objects not being pickled on Windows was to install and use multiprocessing_on_dill instead of python's multiprocessing module. This meant replacing two references to the native multiprocessing module in the __init__.py file for torch.multiprocessing.

There should be a better way of doing this via the multiprocessing_context argument of the Dataloader class constructor that would involve extending torch.multiprocessing, replacing the reference in the extended class, and passing that to the constructor. Anyone more familiar with python and/or torch should be able to do a PR on this. Also conditionally installing multiprocessing_on_dill on windows would have to be a prerequisite.

arstropica commented 3 years ago

I may have spoken too soon.

Using dill instead of pickle prevents the serialization error, but there doesn't seem to be any performance change with the value of num_workers.

With my NVIDIA GeForce RTX 2080 Ti:

num_workers: 0

+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   15k Steps    |     12     |     0.001     |        2         |
+----------------+------------+---------------+------------------+

{| Epoch: 1/3750 (4/4) | Loss: 0.1510 | 0.40 steps/s | Step: 5k | }
{| Epoch: 2/3750 (4/4) | Loss: 0.1501 | 0.44 steps/s | Step: 5k | }
{| Epoch: 3/3750 (4/4) | Loss: 0.1488 | 0.45 steps/s | Step: 5k | }
{| Epoch: 4/3750 (4/4) | Loss: 0.1493 | 0.45 steps/s | Step: 5k | }
{| Epoch: 5/3750 (4/4) | Loss: 0.1507 | 0.46 steps/s | Step: 5k | }
{| Epoch: 6/3750 (4/4) | Loss: 0.1494 | 0.45 steps/s | Step: 5k | }
{| Epoch: 7/3750 (4/4) | Loss: 0.1475 | 0.45 steps/s | Step: 5k | }

num_workers: 2

+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   15k Steps    |     12     |     0.001     |        2         |
+----------------+------------+---------------+------------------+

{| Epoch: 1/3750 (4/4) | Loss: 0.1506 | 0.40 steps/s | Step: 5k | }
{| Epoch: 2/3750 (4/4) | Loss: 0.1577 | 0.45 steps/s | Step: 5k | }
{| Epoch: 3/3750 (4/4) | Loss: 0.1549 | 0.48 steps/s | Step: 5k | }
{| Epoch: 4/3750 (4/4) | Loss: 0.1569 | 0.49 steps/s | Step: 5k | }
{| Epoch: 5/3750 (4/4) | Loss: 0.1536 | 0.50 steps/s | Step: 5k | }
{| Epoch: 6/3750 (4/4) | Loss: 0.1517 | 0.50 steps/s | Step: 5k | }
{| Epoch: 7/3750 (4/4) | Loss: 0.1520 | 0.49 steps/s | Step: 5k | }
{| Epoch: 8/3750 (4/4) | Loss: 0.1504 | 0.48 steps/s | Step: 5k | }
{| Epoch: 9/3750 (4/4) | Loss: 0.1496 | 0.48 steps/s | Step: 5k | }

Not sure multiprocessing is working at all with my fix.

ghost commented 3 years ago

Thanks for sharing your observations @arstropica . num_workers is for the data loader. If the CPU is the bottleneck for training speed, performance may improve with a higher setting. If the bottleneck is the GPU or disk, then num_workers is not relevant.

Not sure GPU acceleration is working properly there. I would expect 5-10x faster training speed with a 2080ti.

arstropica commented 3 years ago

@blue-fish Thanks for your observation. I am not sure how to address the cuda performance issue.

The GPU is definitely working but only at a minimum rate. Benchmark tests show my SSD performance within expected values.

Perhaps it is due to my GPU driver being out of sync with the cuda toolkit. Here is the output from nvidia-smi.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.42       Driver Version: 465.42       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

Not sure why performance is so bad. I have tried installing different toolkit versions: 10.0, 11.1 and 11.2 with the same result. Do I need to downgrade my driver to match the toolkit? I

ghost commented 3 years ago

Going back to the original problem, I don't think this problem is widespread enough on Windows to change the default to num_workers=0. But here is a potential solution in case this gets reported again. https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/89a99647a9e1a5b122305773610e1a4c38e41329 (not tested)

I am going to close this issue. @arstropica , you're invited to open an issue for the training speed problem. I do not have any ideas, but someone else might.

CorentinJ / Real-Time-Voice-Cloning

Pickle error running synthesizer_train.py #669