OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

General Roadmap Discussion #6

Closed mikkokotila closed 5 years ago

mikkokotila commented 6 years ago

I have a little bit better understanding now of the paradigm you are having. It seems that tokenization performance is now much better, and code is much cleaner, excellent work!

Regarding the preprocessing, I did a simple test with a single short made up chunk of text:

'འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་'

Here are some results:

%timeit -n1 pre_processed = PyBoTextChunks(text) 4.51 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit tokens = tok.tokenize(pre_processed) 124 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit tagged = ['"{}"'.format(w.content) for w in tokens] 5.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As it becomes apparent, at the moment the pre-processing is taking most of the time. This seems ok, as spacy, for example, takes several times longer to create a spacy document (which I understand involves doing many more things).

Because of the fact that each scale linearly (i.e. problems will arise with bigger sets of text), I was thinking if you guys had looked into Cython. That's what a lot of these tools end up using, as it will give pure C speeds in many cases, without changing the code much. Yesterday I tried the Tokenizer with just running it as Cython executable without changing the code at all, and it became 20-25% faster. Just changing one line of code (defining the idx variable as int for cython), there was another 5% increase in performance. Where the power lies in though, is in actually building for Cython, where it's not uncommon to get 10-50x performance gains. You guys seem to have a good skill level in programming, so it would be very smooth for you to move to Cython at this early stage (later more painful of course). Another, simpler, performance improvement would be to use Numpy arrays as opposed to using lists.

What do you think?

drupchen commented 6 years ago

I would love to see a fork of pybo with all the above mentioned optimizations, but I am afraid I won't have much time to delve into the intricacies of optimization.

Thank you for your compliment about the programming skills, but as a linguist, I prefer to stay on the side of trying to process languages the best possible way instead of being a real programmer who tries to make things as machine-friendly as possible.

My requirement is that my brother and I continue to have a complete control on what is happening in the code in order to be able to maintain/add new functionalities in the long run.

But it looks like it will be pretty simple for you to make those optimizations, so please, go ahead! We will definitely use it instead of the original pybo!

If, as you say, it doesn't change much in the implementation and only a couple of lines are changed, I will include the optimizations in the main repo and continue from there. It is true my programming skills have improved from the time I wrote pytib, but I don't feel at ease at all when looking at the Cython-oriented code of spaCy, for example.

drupchen commented 6 years ago

Another thing to take into consideration is the functionalities implemented won't change much from what you already see. soon the code will be stabilized, and we won't be drastically modifying it. The next logical step to improve upon what we do would be using ML models to do the same job(like spaCy or other NLP libraries), so pybo won't be needed anymore...

We will want to modify the segmentation provided by pybo, but we plan to operate on the resulting Token objects using matchers that will trigger a split or a merge operation instead of modifying BoTokenizer.

mikkokotila commented 6 years ago

Ok got it. If you can share more about your ML aspirations, I may be able to contribute towards that end. As a starting point, can you elaborate on > "use the spaCy api to make the conversion" (I read this from the pypi package description).

Regarding a simple Cython implementation, that should be pretty straightforward to do in a way where you guys continue to have full control i.e. there is not (much) added complexity to the existing code.

drupchen commented 6 years ago

I said there wouldn't be any drastical modifications, and indeed there weren't many as far as the tokenizer was concerned, but there are lots of new features required in our first use-case to make pybo useful: the tibetaneditor.

You must have a much better idea of what can and should be done in ML. I don't have any clear idea of what should be done, except that one day, one should not be limited anymore by the current rule-based approach that does not deal well with the ambiguities on every of the different levels.

As far as using the spaCy api, I was trying to get my head around of how to go from my list of Token objects to a Doc in spaCy. I had found that one could manually fill the vocabulary container and that it was possible to plug an external tokenizer in the NLP pipeline(s) that spaCy allows to build. I beleive you are already a user of spaCy, so that could be something you might want/be able to do?

Sounds great that implementing Cython should be straightforward !

mikkokotila commented 6 years ago

I think for progressions the question would be what do you think will be the most useful for the community at the moment? SpaCy involves many pain, is not very fast, and is consider by some as "out-dated". I think the best way would be to identify a clear roadmap in terms of the community needs (translators most importantly IMO) and then build based on that. Rather than think about features or capabilities based on the ideas of the past 2 decades in computation linguistics, we could just think about the exact needs of the end-users. I think Tibet Editor is a great example, where improvements in pybo immediately benefit a large number of practitioners. So maybe it's even a question first "what do you guys need in Tibet Editor that you can't do right now with pybo"? The other question I would have is generally regarding translators and researchers; which features they could not live without if they had it?

drupchen commented 6 years ago

We need to think about it. I am sure @ngawangtrinley will have a long wishing list and an idea of what would be the most helpful at this point.

ngawangtrinley commented 6 years ago

@mikkokotila thanks for raising the existential question for pybo. Esukhia's primary goal is to improve the quality of both translations and translators and the tools we're hoping to create, pybo included, definitely go in that direction. We're looking at 4 broad areas: 1. source texts, 2. translator's training, 3. tools for translators, 4. buddhist encyclopedia 2.0.

  1. source texts have to be i. digitized, ii. proofread, iii. linked with the originals (images), iv. linked to other relevant texts like commentaries etc:

    1. OCR like Namsel or google, online buddhist text input tool such as ?
    2. spellchecker (pybo + spellchecker + symspell + bopho)
    3. tbrc.org and others, nalanda.works
    4. BDRC's BUDA project
  2. for the most part training translators is teaching languages, to do it effectively teaching material has to be graded, which in turn is done with things such as headword lists and grading tools. After compiling a corpus, we have to i. clean up, ii. tokenize and annotate data, iii. create headwords, iv. simplify text.

    1. pybo + some normalization tool
    2. pybo + tibetan editor
    3. scripts that need further development
    4. tibetan-editor + syntax simplifier
  3. we're preparing the ground for CAT tools such as smartcat.io, tools to do text analysis like sketchengine or linguee. The foundation of all this is TMs, (bitext, aligned text). To get there we have to i. find sentences, ii. align texts at sentence adn word level (tib-tib and tib-translation), iii. extract recurring patterns, iv. extract term pairs.

    1. rule induction with orange?
    2. giza++? seq2sec? pytorch?
    3. ngrams, skipgrams, flexgrams with colibri
    4. biterm extraction with sketchengine
  4. translators have to be more efficient at looking for information in buddhist texts, we're hoping to virtually replace dictionaries by using the Tibetan buddhist body literature itself as an encyclopedia. You should be able to ask stuff like: all the commentaries of the 4th verse of the 9th chapter of the Boddhicharyavatara ; or any debate about the definition of the word "buddha", by direct disciples of the 15th karmapa, in the context of madhyamika. This roughly involves getting i. text metadata, ii. text content annotation, iii. user created metadata, iv. connecting all of it and making it available.

    1. bibliographical info from tbrc.org or cbeta, text classification to add topics
    2. semi-automated annotation with NER, plagiarism detector to find and source citations
    3. equivalent of paper3 maybe at the end of BUDA?
    4. linked data with the BUDA project, word2vec to build synonyms/relatedwords/wordnet to boost search relevance

The reason we're looking at Spacy is mainly hoping to replace pybo's rule-based tagger with the perceptron or a better tagger and to use prodigy. We need prodigy to create more training data for both POS and the Parser; and of course for NER, we want to get started with tagging stuff like definitions, explanations, refutations etc. The semi-automated loop prodigy offers looks like the best way to do this stuff atm. I'm curious with your comment about Spacy, are you referring to machine learning?

mikkokotila commented 6 years ago

How wonderful. Thank you for sharing. Will digest and come back. By the way, I'm trying colibri, it looks very interesting. Having a robust high performance way of doing all kinds of grams is very potent in terms of acting as an enabler for many other aspirations. Will report back once I know more. It's getting late here, but probably tomorrow will try to use pybo tokenized Bodhisattvacharyavatara for a corpus and see where this can go.

Another quick comment is that for building synonyms, I might have the right kind of contributors to work on something novel. Did some experiments with and it looked good enough to continue. Something that is based on mathematical abstraction more, and less on meaning. Or not on meaning at all preferably.

Think a pipeline in terms of tokenize with pybo > get grams from colibri > a novel approach that focus on unlocking very subtle properties within the way grams relate with each other. For example in signals intelligence higher order derivatives are used to unlock information closer to randomness. Information that is very subtle, which somehow handles a lot of ambiguity that is apparent (seeming like a problem) further away from randomness (more human made order), can yield incredibly useful results.

In short summary, I don't think anything that is out there is out-of-the-box useful for Tibetan. word2vec is not that great for english either...look at sense2vec from same people. That works very well for English. This is a very interesting problem; I think if the scope is right, the right contributors will appear.

Thanks again for a very clear outline. Will come back to this later. I'll create a separate issue for grams once I have something to share on colibri.

ngawangtrinley commented 6 years ago

@mikkokotila @eroux A current and good example of the kind of problems we need to solve is assigning topics to the more or less 400,000 text titles we have at tbrc.org. We have titles for all of them, and for some of them, we have topic labels, other types of metadata, and even sometimes the full text. It should be possible to use some kind of classifier to do this, what would you advise?

I guess my real question is: do you have a solution for that with autonomio? ;)

mikkokotila commented 6 years ago

@ngawangtrinley Yes this is a classic topic modeling problem. Where can I access the said 400,000 titles, I'll do some tests to see if just the title will be enough or not. Best if it could be done like that as, that will always be there for every text.

mikkokotila commented 6 years ago

@ngawangtrinley Specifically related with autonomio and deep learning, then it might be worthwhile to take the topic labels you already have (assuming that's manually done) and just see how using that as a training set, a model could be created. That seems like more interesting approach. Generally speaking deep learning models perform very well for text classification tasks when the training set is good quality. So maybe ignore the previous message about topic modeling and we try a few NN approach first.

I think a good starting point would be to take these labels > https://www.tbrc.org/#!subjects

Basically I'd try to create a binary classification model for each separately, and then have a managerial overlay that handles disputes in some meaningful way. Or maybe train a model just for the dispute handling...

eroux commented 6 years ago

that sounds very exciting! It's going to take some time to extract the titles -> topics correspondence, but I'll try to do that in the next few days

ngawangtrinley commented 6 years ago

@mikkokotila, @eroux is the lead developer of BUDA, he's working on creating linked data for all (yes we mean it!) buddhist literature starting from Tibetan (tbrc.org dataset), Chinese (CBETA and other sources), Pali, and sanskrit. This more or less corresponds to the 4th dev goal of Esukhia so we're all trying to push the field in that direction. My guess is that a lot of the tasks in the project are similar to common tasks in the normal world so we should be able to improve the work with ML. We're all quite new to the field so it's mainly guess work for now, and any input or suggestions are welcome!

eroux commented 6 years ago

here's a zip of all the titles, with the bcp47 string of the language at the end: http://eroux.fr/tmptitles.txt.zip

mikkokotila commented 6 years ago

How wonderful! Thank you very much @eroux. The first question is with the classification task itself. Consider the bcp47 strings frequency table:

345045 bo-x-ewts
23613 sa-x-ndia
10896 pi-x-iast
4258 zh-hans
2614 zh-latn-pinyin-x-ndia
2526 en-x-mixed
2499 en
 170 zh-x-wade
 125 sa-alalc97
 124 bo-alalc97
  95 mn-alalc97
  89 zh-latn-pinyin
  76 zh-hant
  36 sa-x-ewts
  17 sa-Deva
  11 bo-x-ndia

Note some are missing intentionally. It seems that this is all language related? Is there a topical tag of some sort..as that would be needed for training and validation. Or did I miss something? If a few thousand had a topical tag, that would be great.

Very nice catalogue by the way (~400k) titles.

mikkokotila commented 6 years ago

ah my bad, I think I found it...

it's these guys:

1717 dkar chag
 349 mixed texts
 295 dang po/_gleng gzhi'i le'u/
 156 sngon brjod/
 136 dpe skrun gsal bshad/
 135 unidentified
 134 contents
 125 par byang smon tshig
 125 gleng gzhi/
eroux commented 6 years ago

the lang tags are just the lang tag of the string, our strings are RDF so they have lang tags, that's just it. For the topics it will be a bit more difficult to extract but I'll give it a try.

eroux commented 6 years ago

Well, turns out it was actually fairly easy... http://eroux.fr/titlestopics.zip is composed of two files, one is the property :workIsAbout that is supposed to indicate the subject, and :workGenre that is supposed to indicated the genre. Now, the data has been input by Tibetan people, and it seems this distinction is completely foreign to Tibetan culture and cannot really be expressed in the Tibetan language, so it's kind of a strange mixture... When a topic starts with P it's a person and when it starts with G it's a place, otherwise it's a regular topic. I'm not sure great things will come out of this not so great data but maybe the quantity of data will make it work?

mikkokotila commented 6 years ago

Ok great, thanks. Is there a loopup table for the tags? Or something that helps explain how I could group them into high level categories myself.

eroux commented 6 years ago

We have a taxonomy of the topics, I uploaded it on http://eroux.fr/topicstaxonomy.zip

mikkokotila commented 6 years ago

Ok great, this will be very good experiment. Interesting to see what happens.

eroux commented 6 years ago

great! tell us how it goes!

mikkokotila commented 6 years ago

To start, I've structured this as a simple category labeling experiment, where the dataset is limited to those observations (titles) where the category starts with T and has a single digit. This leaves us with around 12k observations. So this is a good starting point to explore the potential. This leaves out some complexities that would actually be there, most importantly observations that are outside of these classes.

To perform the experiment, I've built a simple neural network using Keras Embeddings layer (which is basically doing a form of vectorization), and then I'm using more or less a plain vanilla MLP with categorical_crossentropy as a loss and stochastic gradient descent as optimizer. So far so good...

This is after 500 epochs. The below makes this promising, as usually what you have is validation going all over the place, or over-training or plateau etc. etc. But here we have almost identical trend with the training (the title is confusing sorry). Generally when you see people boast high scores, those scores do not correlate with actual prediction capability.

I'm letting it run now for 2,000 epochs so we find out where the above trend takes us. I'll also try a Convolutional version of the same network configuration. Usually these things take time, maybe long time, but beginning is promising :)

eroux commented 6 years ago

What do you mean by the category starts with T and has a single digit do you mean just T1 to T9? Otherwise this looks very promising indeed!

mikkokotila commented 6 years ago

@eroux ah sorry I forgot this. So basically yes, for the initial experiment I'm just taking those that start with T (some few start with P) and the single digit only. But T2 is not very common, so I dropped it. So the initial problem is classifying into 8 different classes, then we see results and think more. It took about a week to put together some old ideas and write some codes that I needed to do this systematically, but that's done as of pretty much just a few moments ago...so tomorrow I'll start running those codes and we'll see where we get. Will keep you posted, and we can discuss the most useful next steps then.

mikkokotila commented 6 years ago

As far as I can see, we're getting a very good result just using Keras embeddings. So I'm going to continue with that. I tried gensim vectors as well but I need to think more about the sent2vec implementation I created. There are some papers on how to use word2vec vectors to create the sentence vectors, and it seems that the "best approach" is dependent on the prediction problem.

The dataset I'm working with is 13k items (from the total 76k in the file), so each permutation takes some time as I'm running this on a GPU that cost less than $250. Within the parameter space I'm looking at now, the scan seems to consistently yield >90% validation accuracy, and I'm also checking f1 which is > 0.9 which is nice indeed. At this point I'd say that this is doable.

Next step is to bring this initial experiment to some kind of closure, write a small report about it, share it with you, and then we can discuss the next steps. But I want to run a hyperparameter scan for about 1 week first. I should be able to go through a few thousand permutations in that time, and learn more about what kind of things work with this type of prediction challenge (using transliterated Tibetan).

The nice news here is that as you might know, the Keras "embedding" are pure geometry and there is not anything that remotely resembles linguistics. It's just pure math, and pretty simple math for that matter. There are two different basic approaches, one is character level, and the other is word level, I'm doing the word level. For the word level there is the option of using grams, I'm not using it. In addition to the Embedding layer I'm still not using anything else. I think later we can try Conv1D, SimpleRNN and maybe even LSTM given that there is only so many titles that ever need to be classified and new ones are not coming (so retraining should not be an issue). Especially LSTM adds a lot to the time.

It maybe good to connect through skype call or similar and discuss more. It would be helpful for me to understand about your project more.

drupchen commented 5 years ago

closing this issue as the discussion is inactive