Very detailed tutorial up-to-date

FurkanGozukara commented 1 year ago

I made a pull request too please accept if possible : https://github.com/devilismyfriend/ozen-toolkit/pull/7

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC

This tutorial is based on

Ozen Toolkit for data preprocessing DLAS for Training Tortoise TTS Fast for speech synthesis

AIhasArrived commented 1 year ago

Hello @FurkanGozukara do you know if he is mainting it? Anyway I followed your tutorial and Found this problem, I will show you with a screenshot:

I tried opening the ozen env then unisntalling torch and installing the version you suggested, it did not install sucessfully, so I went back to the latest torch, I did not touch the pyannote version as in someone here for an unrelated project said you can ignore the warnings apparently: https://github.com/voicepaw/so-vits-svc-fork/issues/471

My problme is it is making only 2 wav files from a 17 minutes audio file. Also, you did not say ANYTHING about the training file did you? I watched the video since the beginning and you literally say no word about the nature of the traning speech am I wrong? What should I know about it? I am assuming it is a file that contain audio recording of the voice we want to clone and thats it all right?

What I can do to solve this issue, I am having Only 2 wav files formed.. not like you with 300 files... I would REALLY appreciate your help.

FurkanGozukara commented 1 year ago

Hello @FurkanGozukara do you know if he is mainting it? Anyway I followed your tutorial and Found this problem, I will show you with a screenshot:

I tried opening the ozen env then unisntalling torch and installing the version you suggested, it did not install sucessfully, so I went back to the latest torch, I did not touch the pyannote version as in someone here for an unrelated project said you can ignore the warnings apparently: voicepaw/so-vits-svc-fork#471

My problme is it is making only 2 wav files from a 17 minutes audio file. Also, you did not say ANYTHING about the training file did you? I watched the video since the beginning and you literally say no word about the nature of the traning speech am I wrong? What should I know about it? I am assuming it is a file that contain audio recording of the voice we want to clone and thats it all right?

What I can do to solve this issue, I am having Only 2 wav files formed.. not like you with 300 files... I would REALLY appreciate your help.

follow my tutorial but most importantly use the github repo i shared on tutorial

it still works

about training sound files

there are settings configuration on ozen toolkit. change them 1 by 1 and you will find pattern how it splits sound

sadly i didn't cover it that time in tutorial

try to make training sounds between 2 seconds to 15 seconds

AIhasArrived commented 1 year ago

Thanks a lot for the quick answer @FurkanGozukara , I actually used your repo, and tried to follow exalcy your steps one by one. I even went to your whister video and followed it 100% before proceeding to the clone voice one, so I have spent my whole day on this, and still contuining.. The video I tried to download has lot of fast talk, I noticed that ozen toolkit divided it into ONLY 2 FILES, because the video has an introduction line then a little pause then the guy start talking for straight 17 minutes. So i tried to use LOSELESS CUT to remove the intro and see how it works with the rest of the vidoe, .. well ozen toolkit gave me ONLY ONE FILE THIS TIME.

Do you think it has to do with the frequency of the guy speech? Maybe he never stops so the program thought it was one whole long unique sentence of 17 minutes lol ?

WHat are the settings config i could change and where are they, are they in the : \ozen-toolkit\config.ini? This is what it contains for me:

[DEFAULT] hf_token = .... whisper_model = openai/whisper-large-v2 device = cuda diaization_model = pyannote/speaker-diarization segmentation_model = pyannote/segmentation valid_ratio = 0.2 seg_onset = 0.6 seg_offset = 0.4 seg_min_duration = 2.0 seg_min_duration_off = 0.0

Do you think this is what I should try to modify or were you talking about something else? I can also share with you the video I choose to clone the voice from if you want so you can try and see how it goes for you please? (if you want).

FurkanGozukara commented 1 year ago

ye this is accurate

"Do you think it has to do with the frequency of the guy speech? Maybe he never stops so the program thought it was one whole long unique sentence of 17 minutes lol ?"

play with these settings to see if you can improve

valid_ratio = 0.2 seg_onset = 0.6 seg_offset = 0.4 seg_min_duration = 2.0 seg_min_duration_off = 0.0

AIhasArrived commented 1 year ago

Also, what I dont understand is that one or 2 sentences get into "train" folder, and the rest (the whole 17 minutes) gets into a very along sentence inside the "valid" folder? I am so confused.

FurkanGozukara commented 1 year ago

Also, what I dont understand is that one or 2 sentences get into "train" folder, and the rest (the whole 17 minutes) gets into a very along sentence inside the "valid" folder? I am so confused.

yes it will split data so during training both trained and tested

it is expected

AIhasArrived commented 1 year ago

Can you try with this video and see what works best for you? https://www.youtube.com/watch?v=SyMhBeV4lg8 I did not now which voice to clone, and I have a friend who watch him so I want to surprise him with cloning this voice lol. I dont watch him usually so I am just interesting about cloning a voice. I tried seg_min_duration from 2.0 to 0.2 and got one extra file (2 sentences in train and 1 long sentence in valid.txt),

Can you try with him and see what works for you please? So I can have peace in my mind and know that the problem is not from me (as i said i spent the whole day on this I would love to finish it today before midnight)

AIhasArrived commented 1 year ago

yes it will split data so during training both trained and tested

it is expected

The problem is it took 5% of the audio into train.txt and the rest 95% of the audio (from the video) into valid.txt (thats strange no?) You can check it with the link I just sent on the previous comment.

AIhasArrived commented 1 year ago

Btw, what is training speech file that you have used? I would use it to make sure it is working right for me, then see with other type of traning files. This will ensure I dont have any unrelated errors

AIhasArrived commented 1 year ago

Look this is my train.txt file:

Then the rest of the audio in valid.txt:

Rest of the video in one looong sentence..

AIhasArrived commented 1 year ago

Hey @FurkanGozukara , I tried a new video with a guy that speaks LESS RAPIDLY lol, this audio book: https://www.youtube.com/watch?v=gQZ93iqoxfg And I got 45 wav files finally! Now I continue your tutorial :) . I have 3 questions that I will leave for you: 1) Do you know what best settings could work with the first video that has a guy speaking ver fast ? (this one, the one I shared earlier: https://www.youtube.com/watch?v=SyMhBeV4lg8, I still dont know what to change exactly on the settings to make it work, I tried but did not succeed, maybe you can? It could be interesting to know)

2) On youtube you answered one of the comments about other langages, saying it requires time, what did you mean, is it possible to fine tune to learn other langages cloning? How would you do it theoretically ? (I still did not finish your video I am finishing it after this comment)

3) Is this your latest video / technlogy you know of concerning voice cloning, or are you aware of other techs?

thanks

FurkanGozukara commented 1 year ago

Hey @FurkanGozukara , I tried a new video with a guy that speaks LESS RAPIDLY lol, this audio book: https://www.youtube.com/watch?v=gQZ93iqoxfg And I got 45 wav files finally! Now I continue your tutorial :) . I have 3 questions that I will leave for you:

Do you know what best settings could work with the first video that has a guy speaking ver fast ? (this one, the one I shared earlier: https://www.youtube.com/watch?v=SyMhBeV4lg8, I still dont know what to change exactly on the settings to make it work, I tried but did not succeed, maybe you can? It could be interesting to know)

On youtube you answered one of the comments about other langages, saying it requires time, what did you mean, is it possible to fine tune to learn other langages cloning? How would you do it theoretically ? (I still did not finish your video I am finishing it after this comment)

Is this your latest video / technlogy you know of concerning voice cloning, or are you aware of other techs?

thanks

1 : sadly i don't know. you can try those ozen configs 2 : yes possible but i never tried myself. i saw in some comments in those repos that is possible 3 : yes voice cloning is on my research list. hopefully i will come up with even better tutorial video

AIhasArrived commented 1 year ago

Hello, thanks for your answer. I finished the whisper video, then contuined with the clone voice video, everything went well until minute : 23:25 where you suggested we get the same python version as you. Which I did. But I used Conda to install it instead of : "python -m venv venv" So I used this: conda create --name env3109 python=3.10.9

It allowed me to keep my general python version and add this one, btw you should make a tutorial about how to manage multiple and different versions of python, especially how to manage the PATH, now in my base env I can only use my previous general python when taping python, i cant use py3.10.9 by just taping python, this version only works inside the venv (env3109)

Anyway then I tried the next line of code you showed: pip3 install torch==1.13.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 for conda I replaced with: 1) conda install torch==1.13.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 Errors: usage: conda-script.py [-h] [-V] command ... conda-script.py: error: unrecognized arguments: --index-url https://download.pytorch.org/whl/cu11

2) I tried this: conda install torch==1.13.1 torchvision torchaudio From their website: https://pytorch.org/get-started/previous-versions/ But I got errors aswell:

PackagesNotFoundError: The following packages are not available from current channels:

  - torch==1.13.1
  - torchaudio

Current channels:

  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

3) Same for: conda install pytorch torchvision torchaudio pytorch-cuda=1.13.1 -c pytorch -c nvidia

4) So I ended up using a more recent version: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

I got multiple errors such as: ClobberError: This transaction has incompatible packages due to a shared path. packages: nvidia/win-64::cuda-cupti-11.8.87-0, nvidia/win-64::cuda-nvtx-11.8.86-0 path: 'build_env_setup.bat'

Then i contuining following your instruction: pip3 install git+https://github.com/152334H/BigVGAN.git

This line: git clone https://github.com/152334H/BigVGAN.git does not work on normal python env , in conda I needed to decompose it to : cd BigVGAN pip install -r requirements.txt pip install .

I finally tried the script: python tortoise_tts.py --preset fast --ar_checkpoint "maypath/55_gpt.pth" "Welcome to the software engineering courses channel."

And it was very long, despite having a good video card, then I get this: File "C:\Users\user\miniconda3\envs\env3109\lib\site-packages\voicefixer\tools\pytorch_util.py", line 8, in check_cuda_availability raise RuntimeError("Error: You set cuda=True but no cuda device found.") RuntimeError: Error: You set cuda=True but no cuda device found.

This was tiring, I am still with this tutorial and havnt finished it, and still having problems and errors.

I think the problem might have been with torch installation, but your line of code (pip3 install torch==1.13.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117) does not work INSIDE conda, do you know what would be the right line of code inside conda ? As I said I tried several other alternatives but they did not work, such as:

conda install torch==1.13.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
conda install torch==1.13.1 torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=1.13.1 -c pytorch -c nvidia None worked, so I had to go for a more recent version of torch, which might have made all these problems @FurkanGozukara , what do you think? Any quick suggestion in mind?

FurkanGozukara commented 1 year ago

i hate conda

i prefer python

you can have multiple pythons i have a video for that

please watch this video for venvs

https://youtu.be/B5U7LJOvH6g

AIhasArrived commented 1 year ago

Thank you! FInally it worked. It felt so tired to be blocked during 2-3 days! I wonder how much time it took you to pull this up?! Respect. Also I appreciate your answers, thanks again you are doing good deeds.

FurkanGozukara commented 1 year ago

Thank you! FInally it worked. It felt so tired to be blocked during 2-3 days! I wonder how much time it took you to pull this up?! Respect. Also I appreciate your answers, thanks again you are doing good deeds.

it took more than a week :D

i hope you consider to support me on patreon

AIhasArrived commented 1 year ago

OMG! A week, it must have been a HELL lool I actually support you in ways you don't know (or not aware of). That's all I could say for now (I talk about you to others). I have another question about the comments under the video, someone said this: ...

About DLAS being a spyware? I don't understand it got be worried a bit. Why do you guys it look like a spyware I don't get it?

FurkanGozukara commented 1 year ago

OMG! A week, it must have been a HELL lool I actually support you in ways you don't know (or not aware of). That's all I could say for now (I talk about you to others). I have another question about the comments under the video, someone said this:

About DLAS being a spyware? I don't understand it got be worried a bit. Why do you guys it look like a spyware I don't get it?

i think he means how the gui look :) you can see entire code nothing is spyware

devilismyfriend / ozen-toolkit

Very detailed tutorial up-to-date #8