apertium / apertium-python

now you can even use apertium from python
GNU General Public License v3.0
31 stars 27 forks source link

Optimize wrappers #52

Open sushain97 opened 5 years ago

sushain97 commented 5 years ago

While our wrappers avoid the extra process overhead, I think they can still be improved.

e.g. we only need to run https://github.com/apertium/lttoolbox/blob/master/python/lttoolbox.i#L23-L24 one time for any given Analyzer object

On another note, we should also try to do something smarter instead of re-opening files everywhere (but this might be harder to get working).

@unhammer, is doing the first thing I mentioned (i.e. only loading the dictionary once) meaningful in terms of improving performance? Sounds to me like it would be.

unhammer commented 5 years ago

Yeah, opening the fst is quite slow - just try echoing some line a hundred times into lt-proc in the shell vs all in one go (for some megabytes-large fst)

xavivars commented 5 years ago

That was actually the reason why null-flush was such a huge improvement compared to the previous approaches when building web APIs on top of apertium: only loading the dictionaries once can save tens or hundreds of milliseconds each translation

sushain97 commented 5 years ago

Sounds good. After @Vaydheesh wraps all the stuff in the path of a translation (#49), we'll circle back to this and #50. It might be possible to even transition away from null-flush in APy after this works?

xavivars commented 5 years ago

If APy starts using apertium-python (which I think is the plan) I'd say probably.... But not sure if it may have unknown/undesired side effects (a.k.a. bugs), like FST staying in the wrong state after a translation is done. Null-flush was taking care of those things, so maybe the same underlying code can be used by apertium-python, while Apy needs to know nothing about it.

unhammer commented 5 years ago

That's what Pasquale Minervini's old C++ apertium-service http://wiki.apertium.org/wiki/Apertium-service did. But then you lose out on the automatic parallelism that Unix pipelines gives (making you have to basically reimplement | and the buffering involved there) and you have to ensure the cli and lib interfaces are always the same (no missing or slightly different options).

-- Kevin Brubeck Unhammer unhammer@mm.st

sushain97 commented 5 years ago

But then you lose out on the automatic parallelism that Unix pipelines gives (making you have to basically reimplement | and the buffering involved there)

Ah, good point. Right now the implementation in apertium-python will block at each step of the pipeline.

However, on a relatively loaded server, do we expect much benefit for streaming the output through instead of buffering it? Presumably the other cores could be occupied with another translation rather than streaming the output through each step.

unhammer commented 5 years ago

Guess we won't know until we try :)

sushain97 commented 5 years ago

@Vaydheesh are we done here?

singh-lokendra commented 5 years ago

Yes. I've updated all of the wrappers

sushain97 commented 5 years ago

@Vaydheesh wait. You never made the changes to ensure that if I run e.g. a = apertium.Analyzer('en'); a.analyze('cats'), the dictionary is not loaded twice. As I mentioned in my review, you need to store e.g. lttoolbox.FST object around. Otherwise, there's been no optimization at all.

singh-lokendra commented 5 years ago

oops. I forgot that part. I'll figure out a way to store the objects to speed up the process.