Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment

cdleong commented 3 years ago

My Question:

How can I get prepare_text.sh running correctly in a fresh Ubuntu Jupyterlab environment? What needs to be installed, what variables set, etc.?

I've run into various issues attempting to run the script prepare_text.sh, from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/unsupervised/scripts/prepare_text.sh.

Right now, I'm stuck on preprocess.py: error: unrecognized arguments: --dict-only, but I've run into some other errors that I've had to workaround, detailed below.

Full current output:

After getting through all the other issues I detail below, currently this is what I see when I attempt to run the script.

I cloned the https://github.com/pytorch/fairseq.git repo, and navigated to the scripts folder: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts before running this.

(wav2vecu_pre) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/fairseq/examples/wav2vec/unsupervised/scripts$ zsh prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
cut: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/dict.txt: No such file or directory
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
one is 
sed: can't read /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
paste: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
2021-06-03 16:39:42 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/lm.phones.filtered.txt', validpref=None, testpref=None, align_suffix=None, destdir='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/dict.phn.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=70)
Traceback (most recent call last):
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in <module>
    cli_main()
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main
    main(args)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 98, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py", line 54, in load_dictionary
    return Dictionary.load(filename)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 214, in load
    d.add_from_file(f)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 225, in add_from_file
    self.add_from_file(fd)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 249, in add_from_file
    raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: '<SIL>'. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.
prepare_text.sh:49: command not found: lmplz
prepare_text.sh:50: command not found: build_binary
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary
prepare_text.sh:56: command not found: lmplz
prepare_text.sh:57: command not found: build_binary
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable

Fixed (?) Problem: Can't seem to run it from the same folder as the README (workaround: run from scripts folder)

First, I can't run it from the same folder as the README at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data says to. If you try doing so, you get errors with, e.g. path not found to other scripts.

zsh scripts/prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/normalize_and_filter_text.py': [Errno 2] No such file or directory

Fixed (?) Problem: "ValueError: lid.187.bin cannot be opened for loading!" (workaround: use lid.176.bin instead)

Solution: download a different language ID model, and edit the code to use it.

https://fasttext.cc/docs/en/language-identification.html has a different model, lid.176.bin

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

and edit this portion of normalize_and_filter_text.py:

    parser.add_argument(
        "--fasttext-model",
        help="path to fasttext model",
        default="lid.176.bin",
    )

Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)

The script does not list which dependencies are needed. So far I've determined that phonemizer, fasttext are needed, and I think fairseq too. Any more I'm missing?

Fixed (?) Problem: can't find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).

I set this to point to the top level of the cloned repo. not sure if that's right.

(I cloned the repo to ~/work/fairseq/)

export FAIRSEQ_ROOT=~/work/fairseq/

Fixed (?) Problem: Not sure what language code to use. (guessed `sw`)

I've got Swahili data. Not sure whether to use sw, or swahili or what, I assume I should pick from https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

Code

Here's the command I use to invoke the script. Other than editing the default langid model, I haven't edited anything else in the repo, should be the same as https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts. git log shows c47a9b2eef0f41b0564c8daf52cb82ea97fc6548 as the commit.

zsh prepare_text.sh language /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out

What have you tried?

Tried reading https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data
Tried reading https://github.com/pytorch/fairseq/issues/3581 and https://github.com/pytorch/fairseq/issues/3586
Googling for various keywords such as "fairseq preprocess dict-only"

What's your environment?

I'm in a Jupyterlab in a Docker container, running Ubuntu.

OS is Ubuntu 20.04.2:

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

pip list:

pip listPackage                Version
---------------------- -------------------
antlr4-python3-runtime 4.8
attrs                  21.2.0
certifi                2021.5.30
cffi                   1.14.5
clldutils              3.9.0
colorlog               5.0.1
csvw                   1.11.0
Cython                 0.29.23
dataclasses            0.6
editdistance           0.5.3
fairseq                0.10.0
fasttext               0.9.2
hydra-core             1.0.6
isodate                0.6.0
joblib                 1.0.1
numpy                  1.20.3
omegaconf              2.0.6
phonemizer             2.2.2
pip                    21.1.2
portalocker            2.0.0
pybind11               2.6.2
pycparser              2.20
python-dateutil        2.8.1
PyYAML                 5.4.1
regex                  2021.4.4
rfc3986                1.5.0
sacrebleu              1.5.1
segments               2.2.0
setuptools             49.6.0.post20210108
six                    1.16.0
tabulate               0.8.9
torch                  1.8.1
tqdm                   4.61.0
typing-extensions      3.10.0.0
uritemplate            3.0.1
wheel                  0.36.2

conda list:

conda list
# packages in environment at /opt/conda/envs/wav2vecu_pre:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
antlr4-python3-runtime    4.8                      pypi_0    pypi
attrs                     21.2.0                   pypi_0    pypi
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py39hf3d152e_0    conda-forge
cffi                      1.14.5                   pypi_0    pypi
clldutils                 3.9.0                    pypi_0    pypi
colorlog                  5.0.1                    pypi_0    pypi
csvw                      1.11.0                   pypi_0    pypi
cython                    0.29.23                  pypi_0    pypi
dataclasses               0.6                      pypi_0    pypi
editdistance              0.5.3                    pypi_0    pypi
fairseq                   0.10.0                   pypi_0    pypi
fasttext                  0.9.2                    pypi_0    pypi
hydra-core                1.0.6                    pypi_0    pypi
isodate                   0.6.0                    pypi_0    pypi
joblib                    1.0.1                    pypi_0    pypi
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.3                   pypi_0    pypi
omegaconf                 2.0.6                    pypi_0    pypi
openssl                   1.1.1k               h7f98852_0    conda-forge
phonemizer                2.2.2                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
portalocker               2.0.0                    pypi_0    pypi
pybind11                  2.6.2                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
python                    3.9.4           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.9                      1_cp39    conda-forge
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
regex                     2021.4.4                 pypi_0    pypi
rfc3986                   1.5.0                    pypi_0    pypi
sacrebleu                 1.5.1                    pypi_0    pypi
segments                  2.2.0                    pypi_0    pypi
setuptools                49.6.0           py39hf3d152e_3    conda-forge
six                       1.16.0                   pypi_0    pypi
sqlite                    3.35.5               h74cdb3f_0    conda-forge
tabulate                  0.8.9                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torch                     1.8.1                    pypi_0    pypi
tqdm                      4.61.0                   pypi_0    pypi
typing-extensions         3.10.0.0                 pypi_0    pypi
tzdata                    2021a                he74cb21_0    conda-forge
uritemplate               3.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge

I also apt-installed phonemizer dependencies:

sudo apt-get install festival espeak-ng mbrola

And finally, here's what I get from apt list|grep installed apt-list.txt

jimregan commented 3 years ago

You have to install fairseq from git for a version of preprocess.py with --dict-only

Also:

prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary

You're missing kenlm

cdleong commented 3 years ago

You have to install fairseq from git for a version of preprocess.py with --dict-only

Also:
prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary
You're missing kenlm

OK, that first bit worked fine:

pip uninstall fairseq
# (navigate to the top level of the repo)
pip install --editable ./

But now I'm not sure how to install kenlm. I tried pip install https://github.com/kpu/kenlm/archive/master.zip, but the error persists. I will try cloning and building the repo with make, etc

jimregan commented 3 years ago

Yeah; you'll need cmake for that, Also:

apt-get -y install libeigen3-dev liblzma-dev zlib1g-dev libbz2-dev

I've been running make, then setup.py, but I'm not sure if that's strictly necessary (maybe just running setup.py is enough)

cdleong commented 3 years ago

Followed instructions at https://github.com/kpu/kenlm/blob/master/BUILDING to install dependencies for kenlm. What they don't mention is that you need to take the resulting binaries from kenlm/build/bin/ and copy them to /usr/bin

cdleong commented 3 years ago

That seems to have fixed those errors, now trying to figure out fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file

Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2021-06-03 18:37:41 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', simul_type=None, scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared/lm.upper.lid.txt', validpref=None, testpref=None, align_suffix=None, destdir='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared', thresholdtgt=0, thresholdsrc=2, tgtdict=None, srcdict=None, nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=1, workers=1, dict_only=True)
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
one is

cdleong commented 3 years ago

Above error seems to be coming from

sed 's/$/ 1/' $target_dir/words.txt | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o $target_dir/phones.txt -p ' ' -w '' -l $ph_lg -j 70 --language-switch remove-flags

cdleong commented 3 years ago

If you replace each instance of which espeak with which espeak-ng, that fixes that

cdleong commented 3 years ago

had some errors with kaldi_initializer. Lines 51 and 52 lacked $FAIRSEQ_ROOT/

jimregan commented 3 years ago

Followed instructions at https://github.com/kpu/kenlm/blob/master/BUILDING to install dependencies for kenlm. What they don't mention is that you need to take the resulting binaries from kenlm/build/bin/ and copy them to /usr/bin

Oh yeah. kenlm was intended to be embedded in other projects, so it doesn't have any sort of install mechanism

JeromeNi commented 3 years ago

Sorry to "hijack" this issue, but it seems that the kaldi_initializer shows the following issue for the two lines after building the word lm:

Traceback (most recent call last): File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 677, in cli_main initalize_kaldi(cfg) File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 616, in initalize_kaldi cfg.out_labels = cfg.in_labels omegaconf.errors.MissingMandatoryValue: Missing mandatory value: in_labels full_key: in_labels reference_type=Optional[Dict[Union[str, Enum], Any]] object_type=dict

What argument should be passed to it? My guess was that it should be "phn" for in_labels and "wrd" for out_labels as the code seems to be building HCLG graph for latter decoding. However, I don't see where kaldi_initializer is latter used.

jimregan commented 3 years ago

in_labels is phn for this script, out_labels is copied from in_labels if omitted

cdleong commented 3 years ago

I've got most of the errors figured out, only one left:

Traceback (most recent call last):
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in <module>
    cli_main()
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main
    main(args)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 287, in main
    make_all(args.source_lang, src_dict)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 255, in make_all
    make_dataset(vocab, args.trainpref, "train", lang, num_workers=args.workers)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 251, in make_dataset
    make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 184, in make_binary_dataset
    100 * sum(replaced.values()) / n_seq_tok[1],
ZeroDivisionError: division by zero

Which comes from this line

python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/phones/lm.phones.filtered.txt --workers 70 --only-source --destdir $target_dir/phones --srcdict $target_dir/phones/dict.phn.txt

jimregan commented 3 years ago

Try adding --thresholdsrc 2 or something similarly low; I didn't use this for making the phone lm data, but setting the threshold was needed for later calls to preprocess.py

Ar Déar 3 Meith 2021 ag 21:32, scríobh cdleong @.***>:

I've got most of the errors figured out, only one left:

Traceback (most recent call last): File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in cli_main() File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main main(args) File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 287, in main make_all(args.source_lang, src_dict) File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 255, in make_all make_dataset(vocab, args.trainpref, "train", lang, num_workers=args.workers) File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 251, in make_dataset make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers) File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 184, in make_binary_dataset 100 * sum(replaced.values()) / n_seq_tok[1], ZeroDivisionError: division by zero

Which comes from this line

python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/phones/lm.phones.filtered.txt --workers 70 --only-source --destdir $target_dir/phones --srcdict $target_dir/phones/dict.phn.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pytorch/fairseq/issues/3591#issuecomment-854158699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABXQFX5G74F3YV73NY43FLTQ7RDXANCNFSM46BE3UTA .

cdleong commented 3 years ago

I'll give it a try.

I note with interest that some of the files in the output dir are empty, particularly lexicon_filtered.lst

(base) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/WikiDumps$ ll wiki_sw_head_wav2vecu_prepared/
total 72
drwxr-sr-x 3 jovyan users  4096 Jun  3 20:41 ./
drwxr-sr-x 4 jovyan users  4096 Jun  3 20:41 ../
-rw-r--r-- 1 jovyan users  4503 Jun  3 20:41 dict.txt
-rw-r--r-- 1 jovyan users     0 Jun  3 20:41 kenlm.wrd.o40003.arpa
-rw-r--r-- 1 jovyan users     0 Jun  3 20:41 lexicon_filtered.lst
-rw-r--r-- 1 jovyan users 10256 Jun  3 20:41 lexicon.lst
-rw-r--r-- 1 jovyan users 24572 Jun  3 20:41 lm.upper.lid.txt
drwxr-sr-x 2 jovyan users  4096 Jun  3 20:41 phones/
-rw-r--r-- 1 jovyan users  6773 Jun  3 20:41 phones.txt
-rw-r--r-- 1 jovyan users  1317 Jun  3 20:41 preprocess.log
-rw-r--r-- 1 jovyan users  3483 Jun  3 20:41 words.txt
(base) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/WikiDumps$ ll wiki_sw_head_wav2vecu_prepared/phones
total 24
drwxr-sr-x 2 jovyan users 4096 Jun  3 20:42 ./
drwxr-sr-x 3 jovyan users 4096 Jun  3 20:42 ../
-rw-r--r-- 1 jovyan users    8 Jun  3 20:41 dict.phn.txt
-rw-r--r-- 1 jovyan users    8 Jun  3 20:41 dict.txt
-rw-r--r-- 1 jovyan users    0 Jun  3 20:42 lm.phones.filtered.04.arpa
-rw-r--r-- 1 jovyan users    0 Jun  3 20:41 lm.phones.filtered.txt
-rw-r--r-- 1 jovyan users 2763 Jun  3 20:41 preprocess.log
-rw-r--r-- 1 jovyan users    0 Jun  3 20:41 train.bin
-rw-r--r-- 1 jovyan users   26 Jun  3 20:41 train.idx

cdleong commented 3 years ago

Perhaps some input file to that line is empty when it shouldn't be?

jimregan commented 3 years ago

Something went wrong, there should be output in those files

Ar Déar 3 Meith 2021 ag 21:44, scríobh cdleong @.***>:

I'll give it a try.

I note with interest that some of the files in the output dir are empty, particularly lexicon_filtered.lst

(base) @.:~/work/WikiDumps$ ll wiki_sw_head_wav2vecu_prepared/ total 72 drwxr-sr-x 3 jovyan users 4096 Jun 3 20:41 ./ drwxr-sr-x 4 jovyan users 4096 Jun 3 20:41 ../ -rw-r--r-- 1 jovyan users 4503 Jun 3 20:41 dict.txt -rw-r--r-- 1 jovyan users 0 Jun 3 20:41 kenlm.wrd.o40003.arpa -rw-r--r-- 1 jovyan users 0 Jun 3 20:41 lexicon_filtered.lst -rw-r--r-- 1 jovyan users 10256 Jun 3 20:41 lexicon.lst -rw-r--r-- 1 jovyan users 24572 Jun 3 20:41 lm.upper.lid.txt drwxr-sr-x 2 jovyan users 4096 Jun 3 20:41 phones/ -rw-r--r-- 1 jovyan users 6773 Jun 3 20:41 phones.txt -rw-r--r-- 1 jovyan users 1317 Jun 3 20:41 preprocess.log -rw-r--r-- 1 jovyan users 3483 Jun 3 20:41 words.txt (base) @.:~/work/WikiDumps$ ll wiki_sw_head_wav2vecu_prepared/phones total 24 drwxr-sr-x 2 jovyan users 4096 Jun 3 20:42 ./ drwxr-sr-x 3 jovyan users 4096 Jun 3 20:42 ../ -rw-r--r-- 1 jovyan users 8 Jun 3 20:41 dict.phn.txt -rw-r--r-- 1 jovyan users 8 Jun 3 20:41 dict.txt -rw-r--r-- 1 jovyan users 0 Jun 3 20:42 lm.phones.filtered.04.arpa -rw-r--r-- 1 jovyan users 0 Jun 3 20:41 lm.phones.filtered.txt -rw-r--r-- 1 jovyan users 2763 Jun 3 20:41 preprocess.log -rw-r--r-- 1 jovyan users 0 Jun 3 20:41 train.bin -rw-r--r-- 1 jovyan users 26 Jun 3 20:41 train.idx

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pytorch/fairseq/issues/3591#issuecomment-854166098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABXQFWS5IM3NJIDV3REUQLTQ7ST5ANCNFSM46BE3UTA .

cdleong commented 3 years ago

Specifically the divide-by-zero is happening on this line, which means something wrong before then, I think: https://github.com/pytorch/fairseq/blob/c47a9b2eef0f41b0564c8daf52cb82ea97fc6548/examples/wav2vec/unsupervised/scripts/prepare_text.sh#L44

jimregan commented 3 years ago

Specifically the divide-by-zero is happening on this line, which means something wrong before then, I think:

Nah, I was getting the same thing and the input was fine. The same divide-by-zero later was happening because the threshold was set too high, I'm going to assume the same is happening here.

cdleong commented 3 years ago

Added the threshold, still getting divide-by-zero.

I'm still curious, not sure why lexicon_filtered.lst is empty. python filter_lexicon.py -d $target_dir/phones/dict.txt < $target_dir/lexicon.lst >! $target_dir/lexicon_filtered.lst is the line that creates lexicon_filtered.lst,

$target_dir/lexicon.lst doesn't seem to be empty.
dict.txt in $target_dir/phones/ has only <SIL> 0 in it
dict.xt in $target_dir/ has plenty of stuff in it.

Hmmmmmm...

cdleong commented 3 years ago

Specifically I added the threshold to line 44, btw

cdleong commented 3 years ago

I note that https://github.com/pytorch/fairseq/blob/c47a9b2eef0f41b0564c8daf52cb82ea97fc6548/examples/wav2vec/unsupervised/scripts/prepare_text.sh#L38 already has a threshold param, perhaps that's where I was supposed to change it.

cdleong commented 3 years ago

OK, even with threshold 2 on every instance of preprocess.py, still getting divide by zero

cdleong commented 3 years ago

Also, I missed this, but I'm getting this error as well:

Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable

JeromeNi commented 3 years ago

Also, I missed this, but I'm getting this error as well:
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable
I created an empty config directory there to bypass the test and ran with lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words_sil lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones "blank_symbol='<SIL>'" lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones

jimregan commented 3 years ago

Also, I missed this, but I'm getting this error as well:
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable
I created an empty config directory there to bypass the test and ran with lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words_sil lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones "blank_symbol='<SIL>'" lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones

I tried actually creating the yaml for it, and it ignores it completely 🤷

alexeib commented 3 years ago

i am working on more comprehensive instructions on how to run the pipeline - should have something by next week - stay tuned. meanwhile i can answer questions here if need be

JeromeNi commented 3 years ago

Also, I missed this, but I'm getting this error as well:
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable
I created an empty config directory there to bypass the test and ran with lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words_sil lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones "blank_symbol='<SIL>'" lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py kaldi_root=/path/to/kaldi in_labels=phn fst_dir=$target_dir/fst/phn_to_words lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones
I tried actually creating the yaml for it, and it ignores it completely shrug

I think I celebrated too early; when running kaldi_initializer it seems the two lines just quit in the middle before finishing building the final decoding graph. The text corpus was the LibriSpeech LM corpus (https://www.openslr.org/11; using librispeech-lm-norm.txt.gz as is before feeding to prepare_text.sh) For the first line:

[2021-06-03 15:55:15,071][main][INFO] - Creating /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words_sil/LG.phn.kenlm.wrd.o40003.fst [2021-06-03 18:19:26,628][main][ERROR] - cmd: [PosixPath('/nobackup/users/junruin2/kaldi/src/fstbin/fstpushspecial')], err: /nobackup/users/junruin2/kaldi/src/fstbin/fstpushspecial Traceback (most recent call last): File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 677, in cli_main initalize_kaldi(cfg) File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 657, in initalize_kaldi kaldi_root, fst_dir, unique_label, lexicon_graph, grammar_graph File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 273, in create_LG check=True, File "/nobackup/users/junruin2/anaconda3/envs/espnet-pt1.7.1/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '[PosixPath('/nobackup/users/junruin2/kaldi/src/fstbin/fstpushspecial')]' died with <Signals.SIGKILL: 9>.

For the second line:

[2021-06-03 21:57:41,011][main][INFO] - Creating /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/HLGa.phn.kenlm.wrd.o40003.fst [2021-06-03 22:29:25,118][main][ERROR] - cmd: [PosixPath('/nobackup/users/junruin2/kaldi/src/fstbin/fsttablecompose'), PosixPath('/nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/H.phn.fst'), PosixPath('/nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/LG.phn.kenlm.wrd.o40003.fst')], err: /nobackup/users/junruin2/kaldi/src/fstbin/fsttablecompose /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/H.phn.fst /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/LG.phn.kenlm.wrd.o40003.fst Traceback (most recent call last): File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 677, in cli_main initalize_kaldi(cfg) File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 660, in initalize_kaldi kaldi_root, fst_dir, unique_label, h_graph, lg_graph, disambig_in_units_file_int File "/nobackup/users/junruin2/fairseq//examples/speech_recognition/kaldi/kaldi_initializer.py", line 458, in create_HLGa check=True, File "/nobackup/users/junruin2/anaconda3/envs/espnet-pt1.7.1/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '[PosixPath('/nobackup/users/junruin2/kaldi/src/fstbin/fsttablecompose'), PosixPath('/nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/H.phn.fst'), PosixPath('/nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/librispeech_files/unpaired_text/fst/phn_to_words/LG.phn.kenlm.wrd.o40003.fst')]' died with <Signals.SIGKILL: 9>.

Any idea why it would be the case? Also, in which stage are the fsts built by kaldi_initializer.py used?

Thanks!

alexeib commented 3 years ago

most likely you ran out of cpu memory. try to prune the lm a bit more before building fst with it

cdleong commented 3 years ago

Here's what I have for dependencies so far, @alexeib does it look right to you?

pip

pip install phonemizer fasttext
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Apt

# phonemizer dependencies
sudo apt-get install festival espeak-ng mbrola

#kenlm dependencies from official website
sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

Other:

Gotta build kenlm as shown here and copy the built binaries to /usr/bin. You need to apt-install the dependencies first.

JeromeNi commented 3 years ago

Sorry for jumping into this issue again, but I assumed the people involved in the discussion here would be interested in at least replicating some results, so here it goes. Has anyone here successfully trained the GAN model on any of the corpora used by the original paper (https://ai.facebook.com/research/publications/unsupervised-speech-recognition/) and achieved somewhere close to the error rate there? If so, what modifications/hyper-parameters have you modified in the currently published code to achieve so? My attempts have been less than successful (https://github.com/pytorch/fairseq/issues/3581) for the past a week and a half, and I just cannot figure out why...

(Edit 06/08/2021: It works on LibriSpeech-100h! It was my mistake for forgetting to check for .tsv and .phn alignment after removing the silences and re-running wav2vec_manifest.py. The issue is still there for TIMIT however.)

JeromeNi commented 3 years ago

@alexeib Hi, when I tried to use kaldi decoder for wav2vec_generate.py, I found that for some reason, the final HLG graphs were not successfully built when running kaldi_initializers.py in prepare_text.sh. I kept getting the following error, which occurred when executing the very last create_HLG() function within kaldi_initializers.py.

In file included from /home/software/spack/gcc/8.3.0-xdjkb2mmftikxvoeeyaxtxcjtpltcgiz/include/c++/8.3.0/random:51, from /nobackup/users/junruin2/pykaldi-py3.7.9/tools/kaldi/tools/openfst-1.6.7/include/fst/randgen.h:14, from /nobackup/users/junruin2/pykaldi-py3.7.9/tools/kaldi/tools/openfst-1.6.7/include/fst/randequivalent.h:15, from /nobackup/users/junruin2/pykaldi-py3.7.9/tools/kaldi/tools/openfst-1.6.7/include/fst/fstlib.h:61, from /nobackup/users/junruin2/pykaldi-py3.7.9/tools/kaldi/src/fstext/fstext-lib.h:22, from /nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple.cc:9: /home/software/spack/gcc/8.3.0-xdjkb2mmftikxvoeeyaxtxcjtpltcgiz/include/c++/8.3.0/bits/random.tcc:2737:5: note: candidate: 'template<class _IntType, class _CharT, class _Traits> std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::discrete_distribution<_IntType>&)' operator<<(std::basic_ostream<_CharT, _Traits>& __os, ^~~~ /home/software/spack/gcc/8.3.0-xdjkb2mmftikxvoeeyaxtxcjtpltcgiz/include/c++/8.3.0/bits/random.tcc:2737:5: note: template argument deduction/substitution failed: /nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple.cc:91:52: note: 'kaldi::MessageLogger' is not derived from 'std::basic_ostream<_CharT, _Traits>' KALDI_LOG << "Writing FST to " << output << std::endl;

The kaldi version I have is from the latest pykaldi compatible fork: https://github.com/pykaldi/kaldi/tree/pykaldi_02

However, the code seems to be running for now as long as I remove all lines that has an ostream regarding KALDI_LOG. Really don't think it will affect anything other than logging though.

Khanifsaleh commented 3 years ago

i am working on more comprehensive instructions on how to run the pipeline - should have something by next week - stay tuned. meanwhile i can answer questions here if need be

is it finished?

alexeib commented 3 years ago

yes, instructions should be good now.

regarding building the binary for adding self loops - i have plans to rewrite this using pykaldi api instead of c++ but it may take some time. meanwhile, you probably need to build the kaldi toolkit in your pykaldi dir (they have a script for that I believe)

alexeib commented 3 years ago

we will have working timit instructions up next week

for librispeech, you can get it to work with as little as 10h of audio, but depending on what you use for text you may need to adjust the 1k threshold when building phone dict

Enescigdem commented 3 years ago

Hello, I have a problem prepare_text.sh. lexicon.lst file is not created by prepare_text.sh but it wants to use it. What am I supposed to do about this? Another question of mine is language. Can I finetune for my language? @alexeib

alexeib commented 3 years ago

it should be created by this line in the script:

paste $target_dir/words.txt $target_dir/phones.txt >! $target_dir/lexicon.lst

maybe some intermediate step failed?

sorry i dont understand the language question - what exactly do you want to finetune ?

Enescigdem commented 3 years ago

I saw that line should work but it did not. Thus I created a manual lexicon.lst. Then filtered_lexicon.lst is absent and again failed. preparing text is challenging to me.I don't know the exact reason for this situation. Some intermediate steps may be failed but I did not get any error about it. Seeing a working prepared step outputs would be great. @alexeib I managed to create a lexicon filtered and now I got zero division error.

dairm commented 3 years ago

Hello, I can't run script prepare_text.sh, I get a error [2021-06-22 19:44:15,455][main][INFO] - Creating /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/H.phn.fst [2021-06-22 19:44:15,556][main][INFO] - Creating /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/L.phn.lm.phones.filtered.06.fst (in units: /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/kaldi_dict.phn_disambig.txt) [2021-06-22 19:44:15,686][main][INFO] - Creating /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/LG.phn.lm.phones.filtered.06.fst [2021-06-22 19:44:16,007][main][INFO] - Creating /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLGa.phn.lm.phones.filtered.06.fst [2021-06-22 19:44:16,500][main][INFO] - Creating /hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLG.phn.lm.phones.filtered.06.fst [2021-06-22 19:44:17,414][main][ERROR] - cmd: [PosixPath('/hdd/conda_kaldi/rnd_ds/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple'), PosixPath('/hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLGa.phn.lm.phones.filtered.06.fst'), PosixPath('/hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLG.phn.lm.phones.filtered.06.fst')], err: b'' Traceback (most recent call last): File "/hdd/conda_kaldi/rnd_ds/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py", line 677, in cli_main initalize_kaldi(cfg) File "/hdd/conda_kaldi/rnd_ds/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py", line 662, in initalize_kaldi hlg_graph = create_HLG(kaldi_root, fst_dir, unique_label, hlga_graph) File "/hdd/conda_kaldi/rnd_ds/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py", line 595, in create_HLG subprocess.run( File "/hdd/conda_kaldi/rnd_ds/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '[PosixPath('/hdd/conda_kaldi/rnd_ds/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple'), PosixPath('/hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLGa.phn.lm.phones.filtered.06.fst'), PosixPath('/hdd/conda_kaldi/exp_unsup_asr/prepare_text/fst/phn_to_phn_sil/HLG.phn.lm.phones.filtered.06.fst')]' died with <Signals.SIGSEGV: 11>. Do you have any idea why? @alexeib

alexeib commented 3 years ago

you probably ran out of memory because your corpus/num of phonemes is too big. you can change the script to prune the lm or create a 4 or 5gram phone lm instead of 6gram, for use in the WFST

alexeib commented 3 years ago

I saw that line should work but it did not. Thus I created a manual lexicon.lst. Then filtered_lexicon.lst is absent and again failed. preparing text is challenging to me.I don't know the exact reason for this situation. Some intermediate steps may be failed but I did not get any error about it. Seeing a working prepared step outputs would be great. @alexeib I managed to create a lexicon filtered and now I got zero division error.

maybe your corpus is too small and you need to use a smaller phone cutoff threshold? (the example in the readme uses 1000 which is suitable for medium to large corpuses, but not small)

Enescigdem commented 3 years ago

I saw that line should work but it did not. Thus I created a manual lexicon.lst. Then filtered_lexicon.lst is absent and again failed. preparing text is challenging to me.I don't know the exact reason for this situation. Some intermediate steps may be failed but I did not get any error about it. Seeing a working prepared step outputs would be great. @alexeib I managed to create a lexicon filtered and now I got zero division error.

maybe your corpus is too small and you need to use a smaller phone cutoff threshold? (the example in the readme uses 1000 which is suitable for medium to large corpuses, but not small)

I tried with even 2 as threshold but again it gives an error. What should be the size of corpus ? @alexeib

cdleong commented 3 years ago

@alexeib thank you for updating the instructions! They look much improved!

I've finally gotten around to looking at them again, and I have a few things I ran into that I wanted to suggest changes about.

suggestion: add $FAIRSEQ_ROOT to commands

In https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data, there is this code block:

# create a manifest file for the set original of audio files
python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py /dir/to/save/audio/files --ext wav --dest /path/to/new/train.tsv --valid-percent 0

python scripts/vads.py -r $RVAD_ROOT < /path/to/train.tsv > train.vads

python scripts/remove_silence.py --tsv /path/to/train.tsv --vads train.vads --out /dir/to/save/audio/files

python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py /dir/to/save/audio/files --ext wav --dest /path/to/new/train.tsv --valid-percent 0.01

I think that the middle two lines could be updated to include $FAIRSEQ_ROOT as well, like so:

python $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/scripts/vads.py -r $RVAD_ROOT < /path/to/train.tsv > train.vads
python $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/scripts/remove_silence.py --tsv /path/to/train.tsv --vads train.vads --out /dir/to/save/audio/files

suggestion: clarify "/path/to/new/train.tsv"

The --dest arg on this command may be misleading. It actually doesn't want the path to a new file, it wants the path to a directory in which it will create the new train.tsv.

python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py /dir/to/save/audio/files --ext wav --dest /path/to/new/train.tsv --valid-percent 0

When I gave it a the path to the directory ./foo/, it created ./foo/train.tsv in that directory.

On the other hand, when I gave it the path ./bar/train.tsv, it created the directory ./bar/train.tsv/, and created train.tsv inside that, so I ended up with ./bar/train.tsv/train.tsv

suggestion: add quotes around all the variables.

It's often helpful to avoid issues with spaces in the paths, to just wrap all bash variables in quotes.

"$FAIRSEQ_ROOT"

I think these might help the instructions be even easier for people to follow along with. Any thoughts?

alexeib commented 3 years ago

those are all great suggestions. if you would like to submit a PR to improve docs, you are most welcome to do so! otherwise i will keep this in mind when i next touch w2v-u code

cdleong commented 3 years ago

Roger! I'm still going through and taking notes, but I think I might be able to contribute some stuff.

Here's a new note, I just discovered that faiss and npy-append-array are also dependencies for the preprocessing. They're used in prepare_audio.sh

cdleong commented 3 years ago

Another thing I just ran across: IndexError: list index out of range in wav2vec_cluster_faiss.py

I think it's caused by the fact that I was using https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt, which may simply just have 12 layers. I edited the script to print out the length of the relevant variables...

            print(f"res length: " + str(len(res["layer_results"])))
            print(f"self.layer: {self.layer}")

and I get an output like so.

res length: 12
self.layer: 14
  0%|                                                                                                                                                                    | 0/1497 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_cluster_faiss.py", line 219, in <module>
    main()
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_cluster_faiss.py", line 153, in main
    for f in tqdm.tqdm(iterator, total=num):
  File "/opt/conda/envs/wav2vecu/lib/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_cluster_faiss.py", line 132, in iterate
    feats = reader.get_feats(fname)
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_cluster_faiss.py", line 110, in get_feats
    res_layer = res["layer_results"][self.layer]
IndexError: list index out of range

Gonna try again with the larger model, https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt

cdleong commented 3 years ago

Alas, when I try that, it fails in the extract_features step:

Traceback (most recent call last):
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_extract_features.py", line 119, in <module>
    main()
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_extract_features.py", line 107, in main
    generator, num = get_iterator(args)
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_extract_features.py", line 76, in get_iterator
    reader = Wav2VecFeatureReader(args.checkpoint, args.layer)
  File "/home/jovyan/fairseq/examples/wav2vec/unsupervised/scripts/wav2vec_extract_features.py", line 39, in __init__
    [cp_file]
  File "/home/jovyan/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task
    model = task.build_model(cfg.model)
  File "/home/jovyan/fairseq/fairseq/tasks/audio_pretraining.py", line 294, in build_model
    model = super().build_model(model_cfg)
  File "/home/jovyan/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
    model = models.build_model(cfg, self)
  File "/home/jovyan/fairseq/fairseq/models/__init__.py", line 96, in build_model
    return model.build_model(cfg, task)
  File "/home/jovyan/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 176, in build_model
    w2v_encoder = Wav2VecEncoder(cfg, len(task.target_dictionary))
  File "/home/jovyan/fairseq/fairseq/tasks/audio_pretraining.py", line 267, in target_dictionary
    return self.state.target_dictionary
  File "/home/jovyan/fairseq/fairseq/tasks/fairseq_task.py", line 41, in __getattr__
    self._state[name] = self._factories[name]()
  File "/home/jovyan/fairseq/fairseq/tasks/audio_pretraining.py", line 178, in load_target_dictionary
    return Dictionary.load(dict_path)
  File "/home/jovyan/fairseq/fairseq/data/dictionary.py", line 225, in load
    d.add_from_file(f)
  File "/home/jovyan/fairseq/fairseq/data/dictionary.py", line 238, in add_from_file
    raise fnfe
  File "/home/jovyan/fairseq/fairseq/data/dictionary.py", line 235, in add_from_file
    with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
FileNotFoundError: [Errno 2] No such file or directory: '/checkpoint/abaevski/data/speech/libri/960h/wav2vec/raw/dict.ltr.txt

cdleong commented 3 years ago

Tried again with xlsr_53_56k.pt, don't get FileNotFoundError! and the length of res["layer_results"] for that model is 15. So it definitely seems that using wav2vec small was why I had the IndexError

cdleong commented 3 years ago

Tried tracing the FileNotFoundError back, and it seems that when loading in https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt, it actually contains the following key/value pair within it:

'data': '/checkpoint/abaevski/data/speech/libri/960h/wav2vec/raw/'

whereas https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt does not.

cdleong commented 3 years ago

I went to load_checkpoint_to_cpu() in checkpoint_utils, and added some print statements to see what's in there, right before the return statement. When I load the XLSR 53 pretrained model and look at state["cfg"]["task"] I see

{'_name': 'audio_pretraining', 'data': '/private/home/aconneau/projects/XLSR/MLS/53bis/', 'labels': None, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 32000, 'eval_wer': False, 'eval_wer_config': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': 'hard', 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_wer_tokenizer': None, 'eval_wer_post_process': 'letter', 'autoregressive': False}

whereas for 960h, we get

{'_name': 'audio_pretraining', 'data': '/checkpoint/abaevski/data/speech/libri/960h/wav2vec/raw/', 'labels': 'ltr', 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_sample_size': None, 'min_sample_size': None, 'eval_wer': False, 'eval_wer_config': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_wer_tokenizer': None, 'eval_wer_post_process': 'letter', 'autoregressive': False, 'num_batch_buckets': 0, 'precompute_mask_indices': False, 'inferred_w2v_config': None, 'tpu': True}

which look very similar! So why does one succeed, and the other fail?

cdleong commented 3 years ago

Ah, one has "labels", the other does not.

facebookresearch / fairseq