Missing file GIZA++.A3.final

thomaschhh commented 11 months ago

https://github.com/bene-ges/nemo_compatible/blob/194af660d9b6d3d578884048d40b524775fd10e8/scripts/nlp/en_spellmapper/dataset_preparation/prepare_corpora_after_alignment.py#L167

When running get_ngram_mappings.sh I get: No such file or directory: ../align/GIZA++.A3.final. I can't find code that should create this file... I guess the same would happen for GIZA++reverse.A3.final in line 168.

bene-ges commented 11 months ago

Hi, @thomaschhh, did you run all previous steps in get_ngram_mappings.sh? What is your folder structure at this moment? Do you have align directory? See Giza++.log in it - does it have some error reports?

bene-ges commented 11 months ago

The missing files should be created by Giza++ alignment tool. This is external c++ binary.

Commands to install GIZA++ if you don't have it

git clone https://github.com/moses-smt/giza-pp.git giza-pp
cd giza-pp
make
cd ..

thomaschhh commented 11 months ago

I think these are the relevant paths:

SpellMapper/nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/get_ngram_mappings.sh
SpellMapper/align (this is also where GIZA++.log is situated)
~/Desktop/giza-pp

What surprises me is that pred_ctc.all.json and giza_input.txt are empty when running line15 in get_ngram_mappings.sh.

bene-ges commented 11 months ago

@thomaschhh, pred_ctc.all.json should have been generated at the end of previous step run_tts_and_asr.sh here it's a big file with records like this

{"audio_filepath": "tts_resample/82.wav", "text": "aabroo", "g2p": "AA1,B,R,UW2", "pred_text": "a brew"}
{"audio_filepath": "tts_resample/83.wav", "text": "aabsal", "g2p": "B,S,AA1,L", "pred_text": "it's all"}
{"audio_filepath": "tts_resample/84.wav", "text": "aabshar", "g2p": "AA1,B,SH,AA2,R", "pred_text": "of shore"}

thomaschhh commented 11 months ago

What I did is, as proposed in issue #5, I commented out this line https://github.com/bene-ges/nemo_compatible/blob/6c120745e8d42d406d4c19b14baecddf97500b92/scripts/nlp/en_spellmapper/dataset_preparation/run_tts_and_asr.sh#L18

I run in a couple of errors:

anaconda3/envs/spellMapper/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal
nemo_compatible/scripts/tts/tts_en_infer_from_cmu_phonemes.py", line 59, in <module> raw, inp = line.split("\t") ValueError: not enough values to unpack (expected 2, got 1)
cat: 'pred_ctc.x*.json': No such file or directory

Based on this line https://github.com/bene-ges/nemo_compatible/blob/2b1ca5934d57256006a0a9f66c467587ba07df05/scripts/nlp/en_spellmapper/dataset_preparation/run_tts_and_asr.sh#L15 I get e.g. a file that's called xaa which contains:

cycling S,AY1,K,AH0,L,IH0,NG
brussels    B,R,AH1,S,AH0,L,Z
edmonton    EH1,D,M,AH0,N,T,AH0,N
...

and a file called xaa.json which contains the structure that you described above:

{"audio_filepath": "tts/0.wav", "text": "cycling", "g2p": "S,AY1,K,AH0,L,IH0,NG"}
{"audio_filepath": "tts/1.wav", "text": "brussels", "g2p": "B,R,AH1,S,AH0,L,Z"}
{"audio_filepath": "tts/2.wav", "text": "edmonton", "g2p": "EH1,D,M,AH0,N,T,AH0,N"}
...

However, the second file already looks odd. xab:

,AH0,L,EY1
arques  AA1,R,K,S
arquettes en val    AA0,R,K,EH1,T,S, ,EH1,N, ,V,AE1,L
...

xab.json (empty):

bene-ges commented 11 months ago

@thomaschhh , concerning your second error: xaa.json is ok problem with xab is that for some reason input file contained a line missing first field. I pushed a fix to skip such cases.

Concerning your first error with CUDA - is it with call of transcribe_speech? Try replacing in its parameters cuda=1 to cuda=0 or what is your device id.

thomaschhh commented 11 months ago

Those things seemed to have worked, thanks.

Nevertheless, I now get these errors:

    if os.stat(cfg.dataset_manifest).st_size == 0:                                                                                                                                                                             
FileNotFoundError: [Errno 2] No such file or directory: 'xaj_decoded.json'

or

/nemo_compatible/scripts/tts/tts_en_infer_from_cmu_phonemes.py", line 65, in <module>                                                                       
    parsed = text_tokenizer.encode_from_g2p(inp.split(","))                                                                                                                                                                    
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'

for every xa{}.json file

bene-ges commented 11 months ago

another fix, sorry (I forgot continue)

thomaschhh commented 11 months ago

another fix, sorry (I forgot continue)

I can't see the change you implemented. Did you push it? 🙃

bene-ges commented 11 months ago

now pushed)) (I didn't notice that previous push was rejected because I need to first pull your changes)

thomaschhh commented 11 months ago

Traceback (most recent call last):
File "/nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/NeMo/examples/asr/transcribe_speech.py", line 299, in main
filepaths, partial_audio = prepare_audio_data(cfg)
File "/anaconda3/envs/spellMapper/lib/python3.9/site-packages/nemo/collections/asr/parts/utils/transcribe_utils.py", line 219, in prepare_audio_data
if os.stat(cfg.dataset_manifest).st_size == 0:
FileNotFoundError: [Errno 2] No such file or directory: 'xaa_decoded.json'

It seems like all the xa{}_decoded.json files are still missing. Isn't this related to #5?

bene-ges commented 11 months ago

yes, fixed try again

bene-ges / nemo_compatible

Missing file GIZA++.A3.final #6