eddieantonio / mitlm

Automatically exported from code.google.com/p/mitlm
http://code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

maximum line length #41

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Run estimate-ngram on an input text file with lines longer than 4096 
characters, where the 4096th character is in the middle of a word.
2. Check the LM file for a partial words created by splitting the word above.
3.

What is the expected output? What do you see instead?

In a very long line containing e.g. the word "defect', where "c" is the 4096th 
character, the non-words "def" and "ect" appear in the LM.

What version of the product are you using? On what operating system? 0.4.1 on 
Ubuntu 12.04

Please provide any additional information below.

Original issue reported on code.google.com by cbba...@gmail.com on 18 Jul 2014 at 7:28

GoogleCodeExporter commented 8 years ago
I'm analyzing medical dictations where the whole dictation is on one text line.

4096 is too short.  I changed it to 65536 and the problem went away.

Original comment by cbba...@gmail.com on 18 Jul 2014 at 7:46

GoogleCodeExporter commented 8 years ago
Correction: I obvious meant "f" not "c" in the original description.

Original comment by cbba...@gmail.com on 18 Jul 2014 at 7:49