AI4Bharat / IndicXlit

Transliteration models for 21 Indic languages
https://ai4bharat.iitm.ac.in/transliteration
MIT License
76 stars 21 forks source link

lm/train.arpa.en model empty #29

Open Aarif1430 opened 11 months ago

Aarif1430 commented 11 months ago

Hi All, Can you help with data mining from samantra dataset. I am following exactly the same steps as mentioned in the data_mining/transliteration_mining_samanantar folder, everything is working well however the train.arpa.en is not getting populated which breaks next step mosesdecoder/scripts/Transliteration/train-transliteration-module.pl . Any help in this is much appreciated. Attaching some images of my progress to help you understand better. As shown in below seems all required files are getting generated but train.arpa.en is empty (I have also added empty spaces in the file as mentioned)

Screenshot 2023-12-13 at 17 49 16

moses.ini

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0
1 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/workspaces/BashantaraAI/en-mr/align_data/model/phrase-table.gz input-factor=0 output-factor=0
PhraseDictionaryMemory name=TranslationModel1 table-limit=100 num-features=4 path=/workspaces/BashantaraAI/en-mr/align_data/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/workspaces/BashantaraAI/en-mr/align_data/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM name=LM0 factor=0 path=/workspaces/BashantaraAI/en-mr/lm/train.arpa.en order=3 oov-feature=1

# dense weights for feature functions
[weight]
# The default weights are NOT optimized for translation quality. You MUST tune the weights.
# Documentation for tuning is here: http://www.statmt.org/moses/?n=FactoredTraining.Tuning 
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
TranslationModel1= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5 -100

Error in : mosesdecoder/scripts/Transliteration/train-transliteration-module.pl

Error: 'sVok[z]<sVok[z+1]' ::: in Source /workspaces/BashantaraAI/mgiza/mgizapp/src/mkcls/KategProblemTest.cpp:159
ERROR: Execution of: /workspaces/BashantaraAI/mgiza/mgizapp/bin/mkcls -c50 -n2 -p/workspaces/BashantaraAI/en-mr/translit_out/training/corpus.mr -V/workspaces/BashantaraAI/en-mr/translit_out/training/prepared/mr.vcb.classes opt
  died with signal 11, with coredump
.....

cat: /workspaces/BashantaraAI/en-mr/translit_out/training/prepared/mr-en-int-train.snt: No such file or directory
ERROR: Failed to get number of lines in /workspaces/BashantaraAI/en-mr/translit_out/training/prepared/mr-en-int-train.snt at mosesdecoder/scripts/training/train-model.perl line 1132.

.....