Python script to convert NNSVS DBs to DiffSinger without needing the NNSVS Python Library.
This is a Python script for converting singing databases with HTS mono labels into the DiffSinger format. This Python script deals with sample segmentation to mostly ensure that samples have a maximum given length and a maximum amount of pauses in between. The recognized silence phonemes by this script are sil
, pau
, and SP
. AP
and br
are special symbols considered as breaths. All silences turn into SP
and all breaths turn into AP
. It also assumes the labels have silences labeled at the start and end.
This Python script only requires six external libraries to work, unlike the existing one which needs the NNSVS library, which might be hard to install for most people.
numpy
, scipy
, soundfile
, parselmouth
, pyworld
, and librosa
libraries through pip like so:
pip install numpy scipy soundfile librosa praat-parselmouth pyworld
cmd
on the address bar in File Explorer. It will open command prompt with the current working directory as the one opened in File Explorer.db_converter.bat
(similar to basic conversion).cmd
on the address bar in File Explorer. It will open command prompt with the current working directory as the one opened in File Explorer.Anything within the square brackets is optional. Read more about them in the help text.
These example commands assume that your terminal's current working directory is where db_converter.py
or db_converter.bat
is in. If you're using the embeddable version, replace python db_converter.py
to db_converter.bat
.
Tip: You can drag and drop a file or a folder over the terminal window to automatically add the path in the arguments.
If you want to use MakeDiffSinger still to do all the extra variance data needed.
Requirements: NNSVS-style Database (.wav and .lab only)
python db_converter.py [-l max_length -s max_silences -w htk|aud] path/to/nnsvs/db
If you want to use a DiffSinger variance model for timing prediction only.
Requirements: NNSVS-style Database (.wav and .lab only), Language Definition
python db_converter.py [-l max_length -s max_silences -w htk|aud] -L path/to/language-def.json path/to/nnsvs/db
If you want to use a DiffSinger variance model for timing and pitch prediciton.
Requirements: NNSVS-style Database (.wav and .lab only), Language Definition
python db_converter.py [-l max_length -s max_silences -w htk|aud] -L path/to/language-def.json -m [-c] path/to/nnsvs/db
If you want to use a DiffSinger variance model for timing and pitch prediciton.
Requirements: NNSVS-style Database (.wav and .lab only), Language Definition
python db_converter.py [-l max_length -s max_silences -w htk|aud] -L path/to/language-def.json -mD [-c] path/to/nnsvs/db
Language definition is a .json
file containing the vowels and the "liquids" of the phoneme system the labels are on. This .json
file will then be passed on through the --language-def
argument. This is used for DiffSinger variance duration support. Vowels are where the splits will be made. Liquids are for special cases where a consonant is wanted to be the first part of the split. The liquids will be pushed if the phoneme before it is a specified consonant in the given array or any consonant if it is set to true
. Here's a quick example .json
file and a sample split.
{
"vowels" : ["a", "i", "u", "e", "o", "N", "A", "I", "U", "E", "O"], // All vowels in the phoneme set
"liquids" : {
"w" : ["k", "g"], // Only pushes timing for these consonants
"y" : true // Pushes timing for all consonants
}
}
ph_seq | SP | k | a | a | k | a | w | a | k | w | a | b | w | a |...
ph_num | 2 | 1 | 2 | 2 | 2 | 4 | ...
nnsvs-db-converter
can estimate MIDI information using the results of the Language Definition for note timing and pitch detection for the pitch information. This is used for DiffSinger variance pitch support. Read more about how to use it in the help text.
As of 01/16/2024, you now have the ability to choose which pitch estimator to use for MIDI estimation. Only parselmouth
and harvest
are available, which is Praat's To Pitch (ac) function and WORLD's Harvest algorithm respectively. A new argument --pitch-extractor
has been added for this.
Note: Harvest is very CPU-intensive and might be a bit more dangerous to use multithreading on compared to parselmouth.
As of 12/09/2023, nnsvs-db-converter
now supports automatic breath detection, using a similar algorithm from MakeDiffSinger's Acoustic Auto Aligner. It is modified to better capture the breaths, although it may mislabel them still.
A new argument --detect-breaths
or -B
is now added for this, with other arguments to further control the detection. Read more about them in the help text.
As of 12/09/2023, nnsvs-db-converter
now supports multiprocessing to split the workload throughout your computer's cores. Python multiprocessing is known for hogging up CPU, so please use with caution.
A new argument --num-processes
or -T
is added for this. This sets the number of processes used. Setting it to 0 uses all of your cores. Be careful when using this option
As of 12/30/2023, nnsvs-db-converter
now adds a way to check for the sampling rates of each audio file. It will automatically change the sampling rate to the one specified if this does not match. By default, it will make sure all audio files are 44100 Hz. This setting can be turned off by passing -r 0
.
nnsvs-db-converter
supports exporting .ds
files and label files. .ds
files can be exported for usage with SlurCutter, and label files can be exported for checking the segmented labels or generally for segmenting an NNSVS database. You can export in Audacity format or HTK format. Read more about how to use this in the help text.
Note: Labels and DiffSinger phoneme transcriptions might be slightly different because of how phoneme replacement is only done for the DiffSinger phoneme transcriptions and not the regular labels.
usage: db_converter.py [-h] [--num-processes int] [--debug] [--max-length float]
[--max-length-relaxation-factor float] [--max-silences int] [--audio-sample-rate int]
[--language-def path] [--estimate-midi] [--use-cents] [--pitch-extractor parselmouth | harvest]
[--time-step float] [--f0-min float] [--f0-max float] [--voicing-threshold-midi float]
[--detect-breaths] [--voicing-threshold-breath float] [--breath-window-size float]
[--breath-min-length float] [--breath-db-threshold float] [--breath-centroid-threshold float]
[--write-ds] [--write-labels htk | aud]
path
Converts a database with mono labels (NNSVS Format) into the DiffSinger format and saves it in a new folder in the
path supplemented.
positional arguments:
path The path of the folder of the database.
options:
-h, --help show this help message and exit
--num-processes int, -T int
Number of processes to run for faster segmentation. Enter 0 to use all cores. (default: 1)
--debug, -d Show debug logs. (default: False)
segmentation options:
Options related to segmentation.
--max-length float, -l float
The maximum length of the samples in seconds. (default: 15)
--max-length-relaxation-factor float, -R float
This length in seconds will be continuously added to the maximum length for segments that are
too long for the maximum length to cut. (default: 0.1)
--max-silences int, -s int
The maximum amount of silences (pau) in the middle of each segment. Set to a big amount to
maximize segment lengths. (default: 0)
--audio-sample-rate int, -r int
The sampling rate in Hz to put the audio files in. If the sampling rates do not match it will
be converted to the specified sampling rate. Enter 0 to ignore sample rates. (default: 44100)
MIDI estimation options:
Options related to MIDI Estimation. MIDI estimation requires a language definition.
--language-def path, -L path
The path of the language definition .json file. If present, phoneme numbers will be added.
(default: None)
--estimate-midi, -m Whether to estimate MIDI or not. Only works if a language definition is added for note
splitting. (default: False)
--use-cents, -c Add cent offsets for MIDI estimation. (default: False)
--pitch-extractor parselmouth | harvest, -p parselmouth | harvest
Pitch extractor used for MIDI estimation. Only parselmouth reads voicing-threshold-midi.
(default: parselmouth)
--time-step float, -t float
The time step used for all frame-by-frame analysis functions. (default: 0.005)
--f0-min float, -f float
The minimum F0 to detect in Hz. Used in MIDI estimation and breath detection. (default: 40)
--f0-max float, -F float
The maximum F0 to detect in Hz. Used in MIDI estimation and breath detection. (default: 1100)
--voicing-threshold-midi float, -V float
The voicing threshold used for MIDI estimation. (default: 0.45)
breath detection options:
Options for breath detection. Enabled with --detect-breaths.
--detect-breaths, -B Detect breaths within all pauses. (default: False)
--voicing-threshold-breath float, -v float
The voicing threshold used for breath detection. (default: 0.6)
--breath-window-size float, -W float
The size of the window in seconds for breath detection. (default: 0.05)
--breath-min-length float, -b float
The minimum length of a breath in seconds. (default: 0.1)
--breath-db-threshold float, -e float
The threshold in the RMS of the signal in dB to detect a breath. (default: -60)
--breath-centroid-threshold float, -C float
The threshold in the spectral centroid of the signal in Hz to detect a breath. (default: 2000)
output options:
Options related to output DiffSinger database.
--write-ds, -D Write .ds files for usage with SlurCutter or for preprocessing. (default: False)
--write-labels htk | aud, -w htk | aud
Write labels if you want to check segmentation labels. "htk" gives HTK style labels, "aud"
gives Audacity style labels. (default: None)