Missing directory / folder structure

thomaschhh commented 8 months ago

It's not clear where the mentioned folder structure is defined / created. The same goes for DATASET.

https://github.com/bene-ges/nemo_compatible/blob/de0818f86daaeb55f19d36f7dffcdcd046ac9227/scripts/nlp/en_spellmapper/dataset_preparation/generate_configs.sh#L21

bene-ges commented 8 months ago

fixed

bene-ges commented 8 months ago

at the end of build_training_data.sh you will get files test.tsv and train.tsv in your working folder. Then you should copy them (or you can take a fragment of desired number of lines) in some folder and it will be your DATASET

thomaschhh commented 8 months ago

When running build_training_data.sh, I run into two problems:

1.

https://github.com/bene-ges/nemo_compatible/blob/581142829076ed5bca88209dfcdcfa5778087024/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh#L83 contains a space at the end which causes: ./build_training_data.sh: 84: --output_phrases_name: not found

After fixing this, I get the following error:

2.

File "/nemo/collections/nlp/data/spellchecking_asr_customization/utils.py", line 334, in load_index
    with open(input_name, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'index.txt.1'

which happens for every index.txt.{part} file thereafter. Moreover, index.txt.0 is already completely empty. An indicator that something might be going wrong is this:

skip:  Aegaeon (moon)
skip:  Altrincham
skip:  Anfield
...
skip:  Zalman
skip:  Zorbing
1321136
1321136
0 len(custom_phrases) 169653
len(phrases)= 142120 ; len(ngram2phrases)= 100908
len(phrases)= 142120 ; len(ngram2phrases)= 100908

but I am not sure about it.

bene-ges commented 8 months ago

What is the result of this command?

Can you show first lines of files ${SUBMISSPELLS} and ${NGRAM_MAPPINGS} that are passed to it?

bene-ges commented 8 months ago

and their size in lines (wc -l)

thomaschhh commented 8 months ago

$ wc -l sub_misspells.txt: 176560 sub_misspells.txt:

cycling cycling 1   1   2
brussels    where says  9   10  10
brussels    brussels    1   10  6
...
funatics    funatics    1   1   1
funazushi   four n  1   1   1

$ wc -l replacement_vocab_filt.txt: 787237 replacement_vocab_filt.txt:

c   c   77014   125742  93619
c   s   9972    125742  180145
c   k   6072    125742  31821
...
n a o k <DELETE> e+_+l o c  1   1   1
k a _ y c a+l _ y   1   1   1
n a z u <DELETE> <DELETE> <DELETE> <DELETE> 1   1   27422

bene-ges commented 8 months ago

it's too small - is it only for part of data? mine are: 6725799 sub_misspells.txt 3604989 replacement_vocab_filt.txt

bene-ges commented 8 months ago

still you should get something in index.txt.0 after index_phrases.py... do you get empty file?

thomaschhh commented 8 months ago

still you should get something in index.txt.0 after index_phrases.py... do you get empty file?

Yes, index.txt.0 is empty. index.txt.{1-13} do not exist.

bene-ges commented 8 months ago

was it intentional that your input files are smaller? if so, you can share them with me, I will check why index_phrases.py outputs nothing in index.txt.0. Other output files do not exist because your input is small and should fit into first portion.

if you did not intend them to be small, let's compare previous file sizes to found where the divergence originated

bene-ges commented 8 months ago

also you can try to run index_phrases.py using my replacement_vocab_filt.txt from here. Will it output anything?

thomaschhh commented 8 months ago

No, that wasn't intentional. I have been trying to replicate the steps that you mention here https://github.com/bene-ges/nemo_compatible/tree/main/scripts/nlp/en_spellmapper. That means I don't tweak anything to my own data but leave everything as provided in your repo.

I now replaced the replacement_vocab_filt.txt that was generated using your code by the one that you were referring to here https://huggingface.co/bene-ges/spellmapper_asr_customization_en/blob/main/replacement_vocab_filt.txt

However, this still persists:

still you should get something in index.txt.0 after index_phrases.py... do you get empty file?

Yes, index.txt.0 is empty. index.txt.{1-13} do not exist.

replacement_vocab_filt.txt sub_misspells.txt

bene-ges commented 8 months ago

please, check the following sizes (number of lines), should be close to 4523860 pred_ctc.all.json same size should be in ${ALIGNMENT_DIR}/src and ${ALIGNMENT_DIR}/dst and ${ALIGNMENT_DIR}/align.out 11194566 replacement_vocab_full.txt

what was your last link meant for? it just links to this issue. if you want you can share your sub_misspells.txt

thomaschhh commented 8 months ago

I edited my comment so now you can see the files I wanted to upload.

pred_ctc.all.json: 254673 ${ALIGNMENT_DIR}/src: 254277 ${ALIGNMENT_DIR}/dst: 254277 ${ALIGNMENT_DIR}/align.out: 254277

I also just saw that I only have pred_ctc.xa{a-e}.json which might explain the difference in the number of lines down the line. However, shouldn't index.txt.0 at least have some content?

bene-ges commented 8 months ago

I tried to launch with your files, got non-empty index.txt.0 (749095 lines) my command was

SUBMISSPELLS=sub_misspells.txt
NGRAM_MAPPINGS=replacement_vocab_filt.txt

python ${NEMO_COMPATIBLE_PATH}/scripts/nlp/en_spellmapper/dataset_preparation/index_phrases.py \
  --input_file ${SUBMISSPELLS} \
  --output_file index.txt \
  --ngram_mapping ${NGRAM_MAPPINGS} \
  --min_log_prob -1.0 \
  --max_phrases_per_ngram 400 \
  --max_misspelled_freq 10000 \
  --input_portion_size 500000

I cloned nemo_compatible code from the repo (its current state) Can you also check that this file in nemo matches the one in your local nemo repo? It is the only one that is imported from nemo for this operation

thomaschhh commented 8 months ago

Ok that's weird.

Maybe I have to rewind a bit. So, after running this

for part in "xaa" "xab" "xac" "xad" "xae" "xaf" "xag" "xah" "xai" "xaj" "xak" "xal" "xam" "xan" "xao" "xap" "xaq" "xar" "xas" "xat" "xau" "xav" "xaw" "xax" "xay" "xaz"
do
    python ${NEMO_COMPATIBLE_PATH}/scripts/tts/tts_en_infer_from_cmu_phonemes.py --input_name $part --output_dir tts --output_manifest $part.json --sample_rate 16000
    python ${NEMO_PATH}/examples/asr/transcribe_speech.py \
      pretrained_name="stt_en_conformer_ctc_large" \
      dataset_manifest=${part}.json \
      output_filename=./pred_ctc.$part.json \
      batch_size=256 \
      cuda=0 \
      amp=True
done

in run_tts_and_asr.sh, I only get pred_ctc.xa{a-e}.json, and files like e.g xa{h-j}.json are empty. I don't know if that's desired. I think that's also why pred_ctc.all.json: 254673 is so small compared to your file.

When looking at building_training_data.sh, I start seeing empty files starting with idf.txt. Subsequently, yago_wiki.txt, yago_wiki_sample.phrases, yago_wiki_sample.paragraphs, and yago_wiki_sample.paragraphs.norm are empty.

The file utils.py matches the one you mentioned.

bene-ges commented 8 months ago

what about tts_input.txt? mine was 4522860 lines

then it was split to 26 parts (split -n 26) to give xaa..xaz files

same question for yago.uniq2 and yago.vocab.to_cmu.output

bene-ges commented 8 months ago

one more thing - when you import something from nemo, it is imported from the installed version, which may point to your cloned nemo repo, but may not. Can you check in you python env e.g. python3.9/site-packages/ - does it have nemo-toolkit* directory and what is inside?

bene-ges commented 8 months ago

idf.txt should not be empty, though this problem is not related to previous mentioned problems, which were about the processing of wikipedia titles idf.txt is calculated using wikipedia texts which you had to download and put into ${WIKIPEDIA_FOLDER}. Do you have files like part_xaa.tar.gz in it?

thomaschhh commented 8 months ago

what about tts_input.txt? mine was 4522860 lines

1335229 tts_input.txt

one more thing - when you import something from nemo, it is imported from the installed version, which may point to your cloned nemo repo, but may not. Can you check in you python env e.g. python3.9/site-packages/ - does it have nemo-toolkit* directory and what is inside?

nemo nemo_text_processing nemo_text_processing-0.2.2rc0.dist-info nemo_toolkit-1.21.0rc0.dist-info

I installed the necessary packages using pip install nemo_toolkit['all'] as described here.

same question for yago.uniq2 and yago.vocab.to_cmu.output

1336274 yago.uniq2 771553 yago.vocab.to_cmu.output

Inside ${WIKIPEDIA_FOLDER} there are only .txt (29612) but no tar.gz files.

...
├── zirkuh.txt
├── zonguldak.txt
└── zurich.txt

bene-ges commented 8 months ago

I fixed a bug in preprocess_yago.py, now it should give 5909356 lines which is a little better than before. Try to rerun, please.
When I created my wikipedia folder, I split more than million of .txt files into parts(subfolders) and packed each into tar.gz. For example I had a subfolder part_xaa with 30000 files and it turned into part_xaa.tar.gz and so on upto part_xcs.tar.gz.

If you pack your txt files into .tar.gz folder, you should get non-empty idf.txt, but it seems that you have only small part of the wikipedia data.

I will upload all this wikipedia folder to huggingface for simplicity.

Is your file in site-packages/nemo/collections/nlp/data/spellchecking_asr_customization/utils.py the same as utils.py?

bene-ges commented 8 months ago

Here are 107 *.tar.gz files that should be in your ${WIKIPEDIA_FOLDER}

thomaschhh commented 8 months ago

I fixed a bug in preprocess_yago.py, now it should give 5909356 lines which is a little better than before. Try to rerun, please.

Thanks, will do :)

2. When I created my wikipedia folder, I split more than million of .txt files into parts(subfolders) and packed each into tar.gz. For example I had a subfolder `part_xaa` with 30000 files and it turned into `part_xaa.tar.gz` and so on upto `part_xcs.tar.gz`.
If you pack your txt files into .tar.gz folder, you should get non-empty idf.txt, but it seems that you have only small part of the wikipedia data.

Shouldn't the zipping process be in your code, too? Somewhere around here: https://github.com/bene-ges/nemo_compatible/blob/8ace92c40316b1a249e83852c788e1ca74fb640c/scripts/nlp/en_spellmapper/dataset_preparation/preprocess_yago.sh#L40

I will upload all this wikipedia folder to huggingface for simplicity.

Thanks!

The file utils.py matches the one you mentioned.

thomaschhh commented 8 months ago

Here are 107 *.tar.gz files that should be in your ${WIKIPEDIA_FOLDER}

With this data it seemed to have worked. Now, idf.txt is no longer empty.

However, I still think that the creation of the *.tar.gz files should be part of the script.

bene-ges / nemo_compatible

Missing directory / folder structure #10