Closed GoogleCodeExporter closed 9 years ago
The models are serialized using the binary package
<http://hackage.haskell.org/package/binary>, which can be hard to get to work
without causing a stack overflow. Currently there a bunch of extra traversals
of the data structures in src/GramLab/Perceptron/Model.hs to make sure there
are no lazy thunks hanging. These days there are some alternative serialization
packages, maybe switching to one of them will improve things - feel free to try
that, I'd love to get faster loading times.
Other than that, 3 minutes don't seem like much overhead if you're tagging a
reasonably large file. Just choose the right granularity, so that the overhead
does not dominate the tagging time, no?
Original comment by pitekus
on 20 Oct 2011 at 9:33
Hey,
3 mn it's not that much in absolute, it's just that I'm running everything on a
taskfarming set up
with thousands of stuff like
cat foo | morfette predict model | reinsert_tokens | parse | whatever
and the morfette loading model add a lot of time in our cluster time
management..
Original comment by djame.seddah@gmail.com
on 20 Oct 2011 at 9:39
If you make foo a large enough file, this still won't be a problem no? Is there
some reason why you need to process data in tiny chunks?
Original comment by pitekus
on 20 Oct 2011 at 9:54
I think I found the bug :)
The file I was trying to morfetize looked similar to the treetagger format (one
word per line, etc.) but it didn't contain any blank line to mark end of
sentences so morfette was trying to build a line of 170 000 tokens and work
with it :) So morfette looked stucked in the loading models phase but actually
no..
morality : maybe add an line after the model is loaded (plus the time it took
to load), something
like
Model loaded in XX seconds, processing STDIN
Original comment by djame.seddah@gmail.com
on 20 Oct 2011 at 10:03
Ah I see. Good then.
Original comment by pitekus
on 21 Oct 2011 at 8:31
Original issue reported on code.google.com by
djame.seddah@gmail.com
on 20 Oct 2011 at 12:24