gchrupala / morfette

Supervised learning of morphology
BSD 2-Clause "Simplified" License
28 stars 5 forks source link

morfette's model loading way too slow #16

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. train any model with morfette
2. predict anything
3. loading time can take up to 2 or 3 minutes

The problem is that whenever I have to launch morfette on a bunch of files (say 
10 000 for the French Gigaword, the loading model overhead is way too costly.
Grzegorz, is there a way to serialize the models somehow ?   I saw that someone 
submitted a bug report one year ago on that topic and it's true that it would 
be great to have an option to load a fast version of the model... I'm sure that 
even a gzip one (using the libz library, which exists on haskell) would speed 
up everything unless there's a race condition somewhere...

By the way, very good score on italian :) (around 95% on lemma on the Italien 
dependency bank lemma (3500 sentences) + one huge lexicon, 400 000 entry)

Original issue reported on code.google.com by djame.seddah@gmail.com on 20 Oct 2011 at 12:24

GoogleCodeExporter commented 9 years ago
The models are serialized using the binary package 
<http://hackage.haskell.org/package/binary>, which can be hard to get to work 
without causing a stack overflow. Currently there a bunch of extra traversals 
of the data structures in src/GramLab/Perceptron/Model.hs to make sure there 
are no lazy thunks hanging. These days there are some alternative serialization 
packages, maybe switching to one of them will improve things - feel free to try 
that, I'd love to get faster loading times.

Other than that, 3 minutes don't seem like much overhead if you're tagging a 
reasonably large file. Just choose the right granularity, so that the overhead 
does not dominate the tagging time, no?

Original comment by pitekus on 20 Oct 2011 at 9:33

GoogleCodeExporter commented 9 years ago
Hey,
3 mn it's not that much in absolute, it's just that I'm running everything on a 
taskfarming set up
with thousands of stuff like
cat foo | morfette predict model | reinsert_tokens | parse | whatever
and the morfette loading model add a lot of time in our cluster time 
management..

Original comment by djame.seddah@gmail.com on 20 Oct 2011 at 9:39

GoogleCodeExporter commented 9 years ago
If you make foo a large enough file, this still won't be a problem no? Is there 
some reason why you need to process data in tiny chunks?

Original comment by pitekus on 20 Oct 2011 at 9:54

GoogleCodeExporter commented 9 years ago
I think I found the bug :)
The file I was trying to morfetize looked similar to the treetagger format (one 
word per line, etc.) but it didn't contain any blank line to mark end of 
sentences so morfette was trying to build a line of 170 000 tokens and work 
with it :) So morfette looked stucked in the loading models phase but actually 
no..

morality : maybe add an line after the model is loaded (plus the time it took 
to load), something
like
Model loaded in XX seconds, processing STDIN

Original comment by djame.seddah@gmail.com on 20 Oct 2011 at 10:03

GoogleCodeExporter commented 9 years ago
Ah I see. Good then.

Original comment by pitekus on 21 Oct 2011 at 8:31