Open sushain97 opened 5 years ago
Yeah, opening the fst is quite slow - just try echoing some line a hundred times into lt-proc in the shell vs all in one go (for some megabytes-large fst)
That was actually the reason why null-flush was such a huge improvement compared to the previous approaches when building web APIs on top of apertium: only loading the dictionaries once can save tens or hundreds of milliseconds each translation
Sounds good. After @Vaydheesh wraps all the stuff in the path of a translation (#49), we'll circle back to this and #50. It might be possible to even transition away from null-flush in APy after this works?
If APy starts using apertium-python (which I think is the plan) I'd say probably.... But not sure if it may have unknown/undesired side effects (a.k.a. bugs), like FST staying in the wrong state after a translation is done. Null-flush was taking care of those things, so maybe the same underlying code can be used by apertium-python, while Apy needs to know nothing about it.
That's what Pasquale Minervini's old C++ apertium-service http://wiki.apertium.org/wiki/Apertium-service did. But then you lose out on the automatic parallelism that Unix pipelines gives (making you have to basically reimplement |
and the buffering involved there) and you have to ensure the cli and lib interfaces are always the same (no missing or slightly different options).
-- Kevin Brubeck Unhammer unhammer@mm.st
But then you lose out on the automatic parallelism that Unix pipelines gives (making you have to basically reimplement
|
and the buffering involved there)
Ah, good point. Right now the implementation in apertium-python
will block at each step of the pipeline.
However, on a relatively loaded server, do we expect much benefit for streaming the output through instead of buffering it? Presumably the other cores could be occupied with another translation rather than streaming the output through each step.
Guess we won't know until we try :)
@Vaydheesh are we done here?
Yes. I've updated all of the wrappers
@Vaydheesh wait. You never made the changes to ensure that if I run e.g. a = apertium.Analyzer('en'); a.analyze('cats')
, the dictionary is not loaded twice. As I mentioned in my review, you need to store e.g. lttoolbox.FST
object around. Otherwise, there's been no optimization at all.
oops. I forgot that part. I'll figure out a way to store the objects to speed up the process.
While our wrappers avoid the extra process overhead, I think they can still be improved.
e.g. we only need to run https://github.com/apertium/lttoolbox/blob/master/python/lttoolbox.i#L23-L24 one time for any given
Analyzer
objectOn another note, we should also try to do something smarter instead of re-opening files everywhere (but this might be harder to get working).
@unhammer, is doing the first thing I mentioned (i.e. only loading the dictionary once) meaningful in terms of improving performance? Sounds to me like it would be.