Open thomaschhh opened 8 months ago
fixed
at the end of build_training_data.sh you will get files test.tsv and train.tsv in your working folder.
Then you should copy them (or you can take a fragment of desired number of lines) in some folder and it will be your DATASET
When running build_training_data.sh
, I run into two problems:
1.
https://github.com/bene-ges/nemo_compatible/blob/581142829076ed5bca88209dfcdcfa5778087024/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh#L83
contains a space at the end which causes: ./build_training_data.sh: 84: --output_phrases_name: not found
After fixing this, I get the following error:
2.
File "/nemo/collections/nlp/data/spellchecking_asr_customization/utils.py", line 334, in load_index
with open(input_name, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'index.txt.1'
which happens for every index.txt.{part}
file thereafter. Moreover, index.txt.0
is already completely empty. An indicator that something might be going wrong is this:
skip: Aegaeon (moon)
skip: Altrincham
skip: Anfield
...
skip: Zalman
skip: Zorbing
1321136
1321136
0 len(custom_phrases) 169653
len(phrases)= 142120 ; len(ngram2phrases)= 100908
len(phrases)= 142120 ; len(ngram2phrases)= 100908
but I am not sure about it.
What is the result of this command?
Can you show first lines of files
${SUBMISSPELLS}
and ${NGRAM_MAPPINGS}
that are passed to it?
and their size in lines (wc -l
)
$ wc -l sub_misspells.txt: 176560 sub_misspells.txt
:
cycling cycling 1 1 2
brussels where says 9 10 10
brussels brussels 1 10 6
...
funatics funatics 1 1 1
funazushi four n 1 1 1
$ wc -l replacement_vocab_filt.txt: 787237 replacement_vocab_filt.txt
:
c c 77014 125742 93619
c s 9972 125742 180145
c k 6072 125742 31821
...
n a o k <DELETE> e+_+l o c 1 1 1
k a _ y c a+l _ y 1 1 1
n a z u <DELETE> <DELETE> <DELETE> <DELETE> 1 1 27422
it's too small - is it only for part of data? mine are: 6725799 sub_misspells.txt 3604989 replacement_vocab_filt.txt
still you should get something in index.txt.0
after index_phrases.py
...
do you get empty file?
still you should get something in
index.txt.0
afterindex_phrases.py
... do you get empty file?
Yes, index.txt.0
is empty. index.txt.{1-13}
do not exist.
was it intentional that your input files are smaller?
if so, you can share them with me, I will check why index_phrases.py
outputs nothing in index.txt.0
. Other output files do not exist because your input is small and should fit into first portion.
if you did not intend them to be small, let's compare previous file sizes to found where the divergence originated
also you can try to run index_phrases.py
using my replacement_vocab_filt.txt
from here. Will it output anything?
No, that wasn't intentional. I have been trying to replicate the steps that you mention here https://github.com/bene-ges/nemo_compatible/tree/main/scripts/nlp/en_spellmapper. That means I don't tweak anything to my own data but leave everything as provided in your repo.
I now replaced the replacement_vocab_filt.txt
that was generated using your code by the one that you were referring to here https://huggingface.co/bene-ges/spellmapper_asr_customization_en/blob/main/replacement_vocab_filt.txt
However, this still persists:
still you should get something in
index.txt.0
afterindex_phrases.py
... do you get empty file?Yes,
index.txt.0
is empty.index.txt.{1-13}
do not exist.
please, check the following sizes (number of lines), should be close to
4523860 pred_ctc.all.json
same size should be in ${ALIGNMENT_DIR}/src
and ${ALIGNMENT_DIR}/dst
and ${ALIGNMENT_DIR}/align.out
11194566 replacement_vocab_full.txt
what was your last link meant for? it just links to this issue.
if you want you can share your sub_misspells.txt
I edited my comment so now you can see the files I wanted to upload.
pred_ctc.all.json
: 254673
${ALIGNMENT_DIR}/src
: 254277
${ALIGNMENT_DIR}/dst
: 254277
${ALIGNMENT_DIR}/align.out
: 254277
I also just saw that I only have pred_ctc.xa{a-e}.json
which might explain the difference in the number of lines down the line. However, shouldn't index.txt.0
at least have some content?
I tried to launch with your files, got non-empty index.txt.0 (749095 lines) my command was
SUBMISSPELLS=sub_misspells.txt
NGRAM_MAPPINGS=replacement_vocab_filt.txt
python ${NEMO_COMPATIBLE_PATH}/scripts/nlp/en_spellmapper/dataset_preparation/index_phrases.py \
--input_file ${SUBMISSPELLS} \
--output_file index.txt \
--ngram_mapping ${NGRAM_MAPPINGS} \
--min_log_prob -1.0 \
--max_phrases_per_ngram 400 \
--max_misspelled_freq 10000 \
--input_portion_size 500000
I cloned nemo_compatible
code from the repo (its current state)
Can you also check that this
file in nemo
matches the one in your local nemo repo? It is the only one that is imported from nemo for this operation
Ok that's weird.
Maybe I have to rewind a bit. So, after running this
for part in "xaa" "xab" "xac" "xad" "xae" "xaf" "xag" "xah" "xai" "xaj" "xak" "xal" "xam" "xan" "xao" "xap" "xaq" "xar" "xas" "xat" "xau" "xav" "xaw" "xax" "xay" "xaz"
do
python ${NEMO_COMPATIBLE_PATH}/scripts/tts/tts_en_infer_from_cmu_phonemes.py --input_name $part --output_dir tts --output_manifest $part.json --sample_rate 16000
python ${NEMO_PATH}/examples/asr/transcribe_speech.py \
pretrained_name="stt_en_conformer_ctc_large" \
dataset_manifest=${part}.json \
output_filename=./pred_ctc.$part.json \
batch_size=256 \
cuda=0 \
amp=True
done
in run_tts_and_asr.sh, I only get pred_ctc.xa{a-e}.json
, and files like e.g xa{h-j}.json
are empty. I don't know if that's desired. I think that's also why pred_ctc.all.json
: 254673 is so small compared to your file.
When looking at building_training_data.sh, I start seeing empty files starting with idf.txt
. Subsequently, yago_wiki.txt
, yago_wiki_sample.phrases
, yago_wiki_sample.paragraphs
, and yago_wiki_sample.paragraphs.norm
are empty.
The file utils.py
matches the one you mentioned.
what about tts_input.txt
?
mine was 4522860 lines
then it was split to 26 parts (split -n 26
) to give xaa..xaz files
same question for yago.uniq2
and yago.vocab.to_cmu.output
one more thing - when you import something from nemo, it is imported from the installed version, which may point to your cloned nemo repo, but may not. Can you check in you python env e.g. python3.9/site-packages/
- does it have nemo-toolkit*
directory and what is inside?
idf.txt
should not be empty, though this problem is not related to previous mentioned problems, which were about the processing of wikipedia titles
idf.txt
is calculated using wikipedia texts which you had to download and put into ${WIKIPEDIA_FOLDER}
. Do you have files like part_xaa.tar.gz
in it?
what about
tts_input.txt
? mine was 4522860 lines
1335229 tts_input.txt
one more thing - when you import something from nemo, it is imported from the installed version, which may point to your cloned nemo repo, but may not. Can you check in you python env e.g.
python3.9/site-packages/
- does it havenemo-toolkit*
directory and what is inside?
nemo nemo_text_processing nemo_text_processing-0.2.2rc0.dist-info nemo_toolkit-1.21.0rc0.dist-info
I installed the necessary packages using pip install nemo_toolkit['all']
as described here.
same question for
yago.uniq2
andyago.vocab.to_cmu.output
1336274 yago.uniq2 771553 yago.vocab.to_cmu.output
Inside ${WIKIPEDIA_FOLDER}
there are only .txt
(29612) but no tar.gz
files.
...
├── zirkuh.txt
├── zonguldak.txt
└── zurich.txt
I fixed a bug in preprocess_yago.py
, now it should give 5909356 lines which is a little better than before. Try to rerun, please.
When I created my wikipedia folder, I split more than million of .txt files into parts(subfolders) and packed each into tar.gz. For example I had a subfolder part_xaa
with 30000 files and it turned into part_xaa.tar.gz
and so on upto part_xcs.tar.gz
.
If you pack your txt files into .tar.gz folder, you should get non-empty idf.txt
, but it seems that you have only small part of the wikipedia data.
I will upload all this wikipedia folder to huggingface for simplicity.
site-packages/nemo/collections/nlp/data/spellchecking_asr_customization/utils.py
the same as
utils.py?Here are 107 *.tar.gz
files that should be in your ${WIKIPEDIA_FOLDER}
- I fixed a bug in
preprocess_yago.py
, now it should give 5909356 lines which is a little better than before. Try to rerun, please.
Thanks, will do :)
2. When I created my wikipedia folder, I split more than million of .txt files into parts(subfolders) and packed each into tar.gz. For example I had a subfolder `part_xaa` with 30000 files and it turned into `part_xaa.tar.gz` and so on upto `part_xcs.tar.gz`.
If you pack your txt files into .tar.gz folder, you should get non-empty
idf.txt
, but it seems that you have only small part of the wikipedia data.
Shouldn't the zipping process be in your code, too? Somewhere around here: https://github.com/bene-ges/nemo_compatible/blob/8ace92c40316b1a249e83852c788e1ca74fb640c/scripts/nlp/en_spellmapper/dataset_preparation/preprocess_yago.sh#L40
I will upload all this wikipedia folder to huggingface for simplicity.
Thanks!
The file
utils.py
matches the one you mentioned.
Here are 107
*.tar.gz
files that should be in your${WIKIPEDIA_FOLDER}
With this data it seemed to have worked. Now, idf.txt
is no longer empty.
However, I still think that the creation of the *.tar.gz
files should be part of the script.
It's not clear where the mentioned folder structure is defined / created. The same goes for
DATASET
.https://github.com/bene-ges/nemo_compatible/blob/de0818f86daaeb55f19d36f7dffcdcd046ac9227/scripts/nlp/en_spellmapper/dataset_preparation/generate_configs.sh#L21