Is there a manual / docs available or perhaps some use cases?

mikkokotila commented 6 years ago

I checked through the repo, and it seems that there is no documentation. Did I miss something? Perhaps you could provide some simple use case examples on the README so there is an idea of the kind of use paradigm you have implemented.

drupchen commented 6 years ago

There is no real documentation, as the project was not deemed mature enough, and I didn't have enough time to do it. See here for simple use cases. Even more generally, I see two main usecases for the project: either to segment a Tibetan text in words, allowing further NLP processing, or checking the spelling of the text by not introducing spaces between words, but marking the syllables that are not found in the lexical resources. Of course, there will be cases where the words were simply missing from the lexicon, but the idea is then to add these entries in the lexical resources.

Basically, what pytib does is word-segmentation applying a maximal matching algorithm in the first pass (basis_segmentation), then applying some custom rules to adjust the first segmentation (do_compound)

All the other classes are in support of the segmentation. Actually, the whole project is based on a model of the Tibetan syllable that comes from Élie Roux (see this comment). It allows to correctly undo the affixation of case particles, thus allowing to find the unaffixed form of the word. Concretely, this model lives in this file

I could go on about the ideas underlying the project, but I don't know what you are interested in, simply using it or understand how it works under the hood.

mikkokotila commented 6 years ago

Thanks! :) I checked out from the Tibet Editor the way they are using it, and did something to that extent. I will follow the test script tomorrow for some more tests. I used Rinchen Terdzo as raw text, and after the initial run, I guessed that you are using MaxMatch (as short words are clearly prioritized over long ones). It seems that where you've really done tremendous work is on the second part i.e. the rules and all the reference material related to that.

In terms of tokenization, my interest is in going more towards mathematical approaches, such as is the case with word embeddings and one-hot encoding, as opposed to corpora based approaches. But at least for now, I'd be keen to see how what you have can be improved. Also, I'm generally interested in tokenization in all lexical dimensions, everything from character to a sentence. I think that word and phrases will be optimally solved with the same solution.

In terms of MaxMatch, I think this could be worthwhile exploring as an option >

https://github.com/alexvking/wordsegmentation

I'll try to put some time into testing it and getting to some reasonable result in Tibetan.

drupchen commented 6 years ago

have a look at this tagged corpus and the approach they have with CRF.

You might find a great amount of tagged data with a reasonable segmentation and POS tagging here: https://zenodo.org/record/823707#.WkwGt3Xia00 the raw data is here: https://zenodo.org/record/821218#.WkwG83Xia00

This might be used as a minimum threshold of quality for any new segmenting/tagging approach.

mikkokotila commented 6 years ago

Sweet...just finished the run on the 64 volumes of Rinchen Terdzo using Segment(). Took about 2 hours on regular laptop. By the way, do you have a "stopwords" list i.e. where you have words that do not add value to frequency tables? Sorry...this is going pretty far off topic.

drupchen commented 6 years ago

As for the stopwords, I haven't looked into it. It looks like a pretty difficult problem unless we reach the level of parsing the syntax.

Actually, what takes so much time is the implementation of looking up potential words in the lexical resources: https://github.com/Esukhia/PyTib/blob/master/pytib/common.py#L285 (used in Segment)

Something that might drastically improve the performance would be using something like a Trie instead of a regular list and bisect module.

Frankly speaking, I wouldn't rely so much on the quality of pytib for the whole rinchen terdzod, even if using the Tsikchen as the main lexical resource ensures to cover most of the content of Rinchen Terdzod.

As a matter of fact, since the approach in pytib is lexicon-based, all the cases not hard-coded won't be correctly processed. The opposite approach, a rule-based one, would suffer from the opposite flaw: less cases of completely wrong processing, yet less accurate results.

the approach of Nathan Hill seems to perform better overall for the scale of the Rinchen Terdzod

mikkokotila commented 6 years ago

I agree. But it's nice to see that in one afternoon I can get from nothing to result thanks to your work in this field :)

Beyond just word tokenization, I think some of the interesting things to look at are:

spacy
gensim
fasttext
Glove
infersent

Ideally a few projects would pick the same one, and then start contributing to it within the Tibetan context.

I think the most interesting is the stuff that is not invented yet and not talked about yet. Up until the past 2 years or so, mostly to do with vectorization / embeddings, NLP has been an incredibly stagnant field...in comparison to say imaging.

Is there a news group / forum / board where the community discuss / share on the Tibet computational linguistics topic?

drupchen commented 6 years ago

spacy is an option we have been considering since a long time. We will probably adopt it, integrating pytib as the tokenizer for the moment.

but unless we have a proper tokenizer, the whole pipeline of NLP treatments is impossible, so I would rather concentrate of a proper tokenizer in the first place. The main problem for any ML-based approach is the need of a significant training corpus. The Sanskrit NLP suffers the same problem: seeing ML is the only viable solution, yet not having any sufficiently comprehensive training data.

Our main idea has been that segmentation is the basis of everything else, and the foundation for segmentation is the correct analysis of syllables in order to correctly undo case particle affixation. I feel the results from the syllable model we use is satisfactory, yet the segmentation needs to be greatly improved. From a perspective of Linguistics, the problem with Tibetan is that a proper segmentation requires syntactic disambiguation, yet syntactical parsing can only operate on a correctly segmented text... From this point of view, phrases and words need a completely different processing. So any idea or new approach solving some of these problems will be warmly welcomed!

I don't know of a place to discuss Tibetan NLP...

Also note that the pytib version living in tibetaneditor has modifications and options that I have not (yet) integrated in this repo.

Parts of this article I wrote might give you an overall understanding of pytib

eroux commented 6 years ago

@mikkokotila you may be interested in what has been done in lucene-bo: although it's Lucene-specific, it uses a Trie to perform maxmatch pretty efficiently and has a list of stopwords. Glad to see you're interested in Tibetan NLP!

ngawangtrinley commented 6 years ago

@mikkokotila Nice to see an ML guy that's also into Tibetan NLP. Out of curiosity, what are your areas of work/interest or what is it you're trying to achieve with Tibetan texts?

mikkokotila commented 6 years ago

Thanks @eroux, I will surely look into it :)

@ngawangtrinley :) Generally as a computational linguistic researcher my interest has been in language agnosticism and the question of moving completely away from approaches where human conventions (either rules or actual language such as corpus) are used, so let's say pure mathematical abstractions where human conventions play a role, but not specifically in the context of language. I don't think this is just a linguistic related problem; right now it's always people building models. I think we're on the verge of an era where models build models. Then we get surprising results, results that are beyond what we can tell the machine to do. I'll try to publish a short post with some thoughts soon.

Strictly in the Tibetan context, I think it would be really important if there was a workbench of some sort, that would meet a wide range of needs from translators, and language learners, to scholars and researchers, and then to people like us who are also focused on the technology building side. Something that allows rapid prototyping, and a very high-level starting point for new initiatives. Where everything that has been done so far by people like yourselves, come together in one suite, that is extremely user-friendly, embraces to a reasonable extent all that is good about open source, adheres to strict coding and contribution guidelines, etc. Set the stage for the next decade (or few) so to speak. I think that is missing. I think that is quite interesting.

mikkokotila commented 6 years ago

@drupchen I think that ambiguity needs to be embraced. Not sure what that means as of yet though. If you look at 10 different dictionaries for a given word, there is a great degree of variation to how that word is explained. Necessarily language is underpinned by ambiguity, which go far beyond syntactic dimensions. Think about the word "water" ... just by yourself write down 3 words to describe it. Then go and ask 100 people to do the same, collect all their inputs and summarize. How many words do you think you have? Not 300, but probably nowhere near 100 either. Language lacks both criteria of objectivity i.e. participants share precisely the same understanding of what is being transmitted, think in contrast arithmetic where 1 is always 1, and it will be like that in thousands of years from now, and instruments are not precise, think in contrast to thermometer where if we are measuring the same thing, we will get the same result now, and so forth. The issue with NLP field seems to be that it assumes that language is somehow objective, like arithmetic or chemistry for example, which leads to a lot of really rigid approaches which become incredibly painful to execute.

I think past few years I started to get a better idea of the problem (of NLP), but I'm sorry that I don't have much in terms of answers yet.

What comes to the practical matters at hand, I agree with you, tokenizer is a fundamental feature in any scenario (at least for the foreseeable future), solving that "once and for all" should definitely be a priority.

drupchen commented 6 years ago

@mikkokotila What you tell about language reminds me of this approach. Not only is it a linguistic theory, but Rastier was thinking mainly about the internet and computers while constructing his entirely dynamic model of language. In short, everything is a question of "features" (keeping the ML vocab) in Rastier's approach, and there are no clear-cut levels of analysis, as found in traditional Linguistics.

I think that one of the reasons why this system has a hard time getting out of the academical research context and implementing something actually useful in real-life is that there are way too many parameters and interactions at play in this model. Maybe ML could be applied to it with some success ??? (Rastier is quite arduous to read, but beleive me, it is worth the trouble)

mikkokotila commented 6 years ago

@drupchen thanks, that seems interesting! :)

I was thinking that because language lack objectivity, but at least with tokenization/segmentation we want objectivity, then first the "language" that is used to create rules needs to be objective. Which goes back to the point I had made about mathematical rules as opposed to linguistic ones. Maybe the most obvious example to make this point clear is the length of the syllable. Clearly, that's not going to say much, but it's one solid objective signal. One is the number of bytes, and the other is the number of graphemes. Again, these two are not going to say much, but it's a start towards an objective signal taxonomy in the sense of an example of what I'm looking for. Coming from the field of practical application of the signals intelligence method to a wide range of problems, it has become obvious that practical problems are generally practically solvable using the signals intelligence method, but you need a lot of signals. With a lot of signals very small test sets can do the job. I mean very small. Maybe talking hundreds of entities...in this case sentences or other text fragments that are already segmented. Then an exhaustive global grid search needs to be performed where each signal is used as a single feature in the model. I think the relevant parameter space of the hyperparameter optimization problem, in this case, is very large. But maybe just weeks to months of computing. There are some more sophisticated ways to do the same job, but they don't quite come with the analytical benefits of global grid search within wide boundaries of ~20 model parameters. Also a range of models should be tested. From fastest to slowest, because sometimes LSTM does not work any better than MLP does...depends totally on the problem and the signals you have. It seems that the most obvious signal is "the first character of a word" i.e. the truth set will have true or false for every byte and every grapheme. Then the second, third and so forth. Then the same for last.

Yes, there is a need for a training set, but with the right model, it can be so small that for a big translation project or similar it can easily be done at the beginning of each project. There is never need for any generic corpus. Also because the text is static i.e. Rinchen Terdzo will always be Rinchen Terdzo, and Ati-yoga Ati-yoga, Maha-yoga Maha-yoga, and so forth...we are only dealing with ancient texts, there is no need for retraining which seems to be one of the key issues in deep learning.

I have no idea if this will work, but it sounds whacky enough that it just might.

Esukhia / PyTib

Is there a manual / docs available or perhaps some use cases? #7