Add statistical tokenization algorithms

Ayushk4 commented 4 years ago

BERT and related models have been using statistical tokenization algorithms. These work well on out-of-vocab words with ML models. High-speed implementations of BPE / WordPiece etc. will be good additions to the package.

Some nice work has been going here

Thanks to @ ninjin for the suggestion.

oxinabox commented 4 years ago

this would be great, yes

tejasvaidhyadev commented 4 years ago

Any suggestion how to implement in pure julia. I tried to wrap sentencepiece which works fine but i think that degrades its performance. @Ayushk4 and @oxinabox

tejasvaidhyadev commented 4 years ago

I think we can start with Unigram language model - Subword segmentation algorithm based on a unigram language model. which is capable of outputing multiple subword segmentations with probabilities. From paper Kudo We can use all the tokenizers from wordtokenizer as Pre-tokenizer. The unigram language model can be used to solve out of vocab problem and also for selecting most probable sentence among multiple encoding. BytePairEncoding.jl is already implemented in julia

tejasvaidhyadev commented 4 years ago

Hi @Ayushk4 and @oxinabox, as mention above i am planning to add unigram language model based subword segmentation algo

APIs

pretokenizer()- to select pretokenizer from WordTokenizers
learner() - to implement algo
APIs - to emit segmented words

oxinabox commented 4 years ago

@Ayushk4 are you able to consider and discuss this?

Ayushk4 commented 4 years ago

@oxinabox Yah, sure. I can discuss on this.

However, I will have to revisit statistical tokenizers before I can comment. I am bit busy for the weekend, so @tejasvaidhyadev , it could take me a couple of days to get back to you on this.

Ayushk4 commented 4 years ago

Hi, sorry for the delay.

We are gonna be needing byte pair encoding as a part of this. There is a package BytePairEncodning.jl for this. But it has WordTokenizers.jl as a dependency.

If we use that as a dependency, then would this circular dependency be a problem??

If we chose to implement our own BPE, then we can also experiment out with our fast TokenBuffer API and see if we could have a better performance. Else we can use the package BytePairEncodning.jl for BPE.

@oxinabox what would you suggest in this case?

oxinabox commented 4 years ago

Circular dependencies are indeed a problem.

Good question.

We could always make a new package: StatisticalTokenizers.jl depending on both WordTokenizers and BytePairEncoding.jl at least to experiment.

How hard would implementing BPE be? I don't think it would be that hard? I am not sure though

Ayushk4 commented 4 years ago

I don't think implementing bpe will be hard.

tejasvaidhyadev commented 4 years ago

Hi @oxinabox, I have completed Sentencepiece Processor in Julia Here is my blog with implementation detail. for now, It only supported sentence piece processor

oxinabox commented 4 years ago

You should talk to @Ayushk4

Ayushk4 commented 4 years ago

@tejasvaidhyadev , I am looking at the code for Sentenpiece. I will get back to you in a few hours.

Ayushk4 commented 4 years ago

@tejasvaidhyadev What is the difference between the 2 files here? https://gist.github.com/tejasvaidhyadev/21a092ff3fe1f2c146a60af44b9519c1

tejasvaidhyadev commented 4 years ago

@tejasvaidhyadev What is the difference between the 2 files here? https://gist.github.com/tejasvaidhyadev/21a092ff3fe1f2c146a60af44b9519c1

Both are the same ignore the other one. I will add more docstring for a better explanation

Ayushk4 commented 4 years ago

Okay. The code looks okay to me, I will go in detail once you send the PR. A couple of things regarding the API-

Let's keep the sentencepiece tokenizer agnostic of vocab file. This will allow the same API to be used for ALBERT as well as other models or other languages tokenizers.
Since you will be using Albert sentencepiece for your work you can use DataDeps for downloading the vocab file. So API will be something like- tokenizer = SentencePiece(ALBERT) this can be similar to Embeddings.jl as you suggested. We can also have another API for tokenizer = SentencePiece(vocab_path::String)

I think using TokenBuffer will not be possible here since the decoding is done by viterbi algo here. Try to keep type conversions in the forward_algo minimal since the forward_algo will be O(n^2).

Since this will not be using tokenbuffer and we can expect more statistical algo to be added to this package, I think we can have the code for sentencepiece inside new folder src/statistical.

@oxinabox what do you suggest on this?

oxinabox commented 4 years ago

I think that all sounds good to me. Yes.

aviks commented 4 years ago

Close this given #51 is merged?

Ayushk4 commented 4 years ago

There are other statistical tokenization like wordpiece and BPE that are yet to be added.

oxinabox commented 4 years ago

right, but should open superate issues for those as needed? Since there will always be some statisticallal tokenization algorithm we done have?

Ayushk4 commented 4 years ago

Okay.

JuliaText / WordTokenizers.jl

Add statistical tokenization algorithms #44