danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Wrong parsing of .counts files #100

Closed micr0cuts closed 5 years ago

micr0cuts commented 5 years ago

https://github.com/danpovey/pocolm/blob/5328e4317acb7dea96fba5c38b048100abaec5d1/scripts/get_unigram_weights.py#L40 The counts files have the format <count> <word>, so get_unigram_weights.py should look like this from line 40:

for line in f:
    line = line.split()
    word_to_count[line[1]] += int(line[0])

With the current code the only "words" that are matched across dev and train sets are the counts of the unigrams but in string format!

danpovey commented 5 years ago

Do you know how to make a pull request? Might be a python3 issue.

danpovey commented 5 years ago

Oh I see, it's not a python3 issue. This should only affect metaparameter initialization, but it's still a bug. I'll fix it.

danpovey commented 5 years ago

I resolved this via push (should have done it via PR, but anyway)... so it's resolved.