Esukhia / PyTib

MIT License
4 stars 2 forks source link

does do_compound work / general performance improvement? #8

Open mikkokotila opened 6 years ago

mikkokotila commented 6 years ago

I did some testing comparing the results with diff for begining of do_compound and end of do_compound, and it seems that at least with the test data (some snippets of Rinchen Terdzo) there was no substantial difference. do_compound takes almost half of processing time though. So the question I have is if do_compound have been tested properly in the sense of quantifying the way it changes the output. Because of the change is marginal, but performance is significant, it might be something to think about.

In case the current implementation yields significant functional value, then I think it would be important to go through the code and see how it can be optimized. I found some very quick wins such as not doing len() inside list comprehensions etc. I did not get that far yet with the code, but generally, if there are a list comprehensions for comparing two list of strings, set intersection comparison is generally much faster.

drupchen commented 6 years ago

First of all, Thank you for showing so much interest!

There surely is plenty of space for improvements.

I thought of the segmentation in two steps:

  1. segment in smallest units possible
  2. apply rules to adjust this

The way I reduce to the minimum is by using minimal words as lexical resources in uncompound_lexicon.txt, while using a maximal matching algorithm.

The most obvious example that comes to mind is the "pa" and "ba" particles. There are many cases where these should not be separated as basis_segmentation does. think of phrases such as "zhes bya ba" or any verb followed by pa/ba. The optimization problem surely comes from the fact I loop over the segmented text as many times as there are rules. There must be a better way of doing it.

I think acheiving satisfactory results in one pass is impossible owing to the ambiguities we have to resolve, so I think we are bound to have at least two passes – most probably more than two to introduce things like shallow syntactic parsing and POS tagging to be able to resolve a maximum of syntactic ambiguities in the segmentation.

But if you are ready to invest time into pytib or even in creating something else that would address the issue of segmentation(I saw your "bokepy" repo), after my experience with the Lucene Analyzers, I would suggest you to not take as input a simple string, but a stream of characters. The idea would then be to have the stream chopped into tokens, and then have several modules that modify the tokens thus produced. It seems to be in tune with how spaCy handles data to be fed to all the tools they implement and it also corresponds to the Lucene tokenizers and token filters.

An approach like this would make things modular and it would be easy to improve or maintain the project, yet it would surely require to rewrite everything as pytib can't be adapted to function on a stream of characters.

I think that instead of improving the existing, it would be more beneficial to build something on sound bases, instead of improving something that is not clean. If you would like to undertake such a project, I will be happy to help you with the experience I have from building pytib, yet I won't be able to commit a lot of time. Me and my brother will be able to give you feedback as persons fluent in Tibetan, if that is needed.