Closed GoogleCodeExporter closed 9 years ago
Hi, Joseph, thanks for your question.
The format should be as in the example directory:
test/edu/berkeley/nlp/lm/io/googledir
Specifically, the directory look like
# 1gms/vocab_cs.gz [here, vocab_cs.gz should have the unigram frequencies
sorted in decreasing order of frequency]
# 2gms/2gm-0001.gz 2gm-0002.gz …
# 3gms/3gm-0001.gz 3gm-0002.gz …
# ...
Given your directory structure, you will need create an [n]gms directory for
n=1..5, and then copy/soft-link all files for each order to the corresponding
[n]gms directory. You might also need to create the vocab_cs.gz by sorting the
unigram file, though this comes with at least the English distribution (in
1gms).
I have added additional documentation about this to the example script for the
next release.
Original comment by adpa...@gmail.com
on 20 Nov 2011 at 5:00
Original comment by adpa...@gmail.com
on 20 Nov 2011 at 5:00
Thanks, that worked great.
Original comment by tur...@gmail.com
on 24 Nov 2011 at 2:56
Original issue reported on code.google.com by
tur...@gmail.com
on 19 Nov 2011 at 11:55