bene-ges / nemo_compatible

useful things that work with NVIDIA NeMo library
Apache License 2.0
9 stars 1 forks source link

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

Open thomaschhh opened 10 months ago

thomaschhh commented 10 months ago

I am currently looking into the building of the training dataset but it seems like the referenced file is nowhere to be found:

https://github.com/bene-ges/nemo_compatible/blob/27bce6d91a12b74f6e4f84f18998df2d80582470/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh#L23

./build_training_data.sh: 25: /nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/NeMo/examples/nlp/spellchecking_asr_customization/evaluation/get_all_titles_from_spoken_wikipedia.py: not found

bene-ges commented 10 months ago

it's in /nemo_compatible/scripts/nlp/en_spellmapper/evaluation/get_all_titles_from_spoken_wikipedia.py fixed the comment

thomaschhh commented 10 months ago

That's working, thanks.

I am wondering though where the input_folder is supposed to be.

https://github.com/bene-ges/nemo_compatible/blob/45fdcead04a3bd7259e142fbbd4d76836908a45d/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh#L23

bene-ges commented 10 months ago

oh, it's the spoken wikipedia folder it should appear after downloading and unzipping spoken_wikipedia, see this code

thomaschhh commented 10 months ago

Looks like the dataset is no longer available:

WARNING: cannot verify corpora.uni-hamburg.de's certificate, issued by ‘CN=GEANT OV RSA CA 4,O=GEANT Vereniging,C=NL’:
  Issued certificate has expired.
HTTP request sent, awaiting response... 500 Service unavailable (with message)
2023-11-14 11:24:59 ERROR 500: Service unavailable (with message).
bene-ges commented 10 months ago

Ok, I put spoken_wiki_titles.txt to the repo, should be sufficient for training if you later need the full spoken_wikipedia for evaluation, and it still is unavailable, tell me, I will upload my copy to huggingface