emawind84 / mitlm

Automatically exported from code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

unicode input to mitlm #16

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Dear Sir,

I want to make some request.

Can you modify mitlm to take unicode files  as input? 

Original issue reported on code.google.com by gsrvijay...@gmail.com on 22 Jan 2010 at 6:07

GoogleCodeExporter commented 8 years ago
Hi, you could probably overcome this by mapping all unicode words in your 
corpora to
some IDs, and running mitlm on the mapped corpora. The resulting language model 
will
contain only IDs, so you also need to do inverse-mapping back to unicode 
strings.

If your unicode characters can be converted to some 8bit charset, try to use 
iconv.
For Czech I use 'iconv -f utf8 -t iso8859-2 < corpora_unicode.txt > 
corpora_iso.txt'

Good luck,
Miso

Original comment by michal.f...@gmail.com on 23 Feb 2010 at 9:22

GoogleCodeExporter commented 8 years ago
UTF-8 without BOM seems to work fine under Windows (0.4) and under Ubuntu 
(r50). Interestingly, it doesn't work under WINE (0.4).

This is the beauty of UTF-8.

Note: Haven't checked UTF-8 with BOM. It would be nice if mitlm would ignore 
the BOM if it doesn't already.

Original comment by adubin...@almson.net on 2 Nov 2012 at 7:58

GoogleCodeExporter commented 8 years ago
Issue 21 has been merged into this issue.

Original comment by giuliop...@gmail.com on 3 Feb 2013 at 6:27

GoogleCodeExporter commented 8 years ago
Going to open a separate ticket for this, but in case anyone else is looking at 
this issue for a solution MITLM seemingly taking a dislike to certain 
characters in its input, I found that when using count files, a # character at 
the start of a token will cause it to crash with:

estimate-ngram: src/NgramModel.cpp:800: void 
mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != 
NgramVector::Invalid)' failed.
Aborted (core dumped)

I'm guessing it interprets # as a comment if it occurs at the start of a line 
of text in the counts file. Not very helpful, especially since estimate-ngram 
-wc will itself write out lines beginning with # if a token beginning with # 
(like a hashtag) occurs in the source text.

Original comment by matt...@swiftkey.com on 10 Feb 2015 at 6:31