Closed Ayushk4 closed 4 years ago
this would be great, yes
Any suggestion how to implement in pure julia. I tried to wrap sentencepiece which works fine but i think that degrades its performance. @Ayushk4 and @oxinabox
I think we can start with Unigram language model - Subword segmentation algorithm based on a unigram language model. which is capable of outputing multiple subword segmentations with probabilities. From paper Kudo We can use all the tokenizers from wordtokenizer as Pre-tokenizer. The unigram language model can be used to solve out of vocab problem and also for selecting most probable sentence among multiple encoding. BytePairEncoding.jl is already implemented in julia
Hi @Ayushk4 and @oxinabox, as mention above i am planning to add unigram language model based subword segmentation algo
APIs
@Ayushk4 are you able to consider and discuss this?
@oxinabox Yah, sure. I can discuss on this.
However, I will have to revisit statistical tokenizers before I can comment. I am bit busy for the weekend, so @tejasvaidhyadev , it could take me a couple of days to get back to you on this.
Hi, sorry for the delay.
We are gonna be needing byte pair encoding as a part of this. There is a package BytePairEncodning.jl for this. But it has WordTokenizers.jl as a dependency.
If we use that as a dependency, then would this circular dependency be a problem??
If we chose to implement our own BPE, then we can also experiment out with our fast TokenBuffer API and see if we could have a better performance. Else we can use the package BytePairEncodning.jl for BPE.
@oxinabox what would you suggest in this case?
Circular dependencies are indeed a problem.
Good question.
We could always make a new package: StatisticalTokenizers.jl depending on both WordTokenizers and BytePairEncoding.jl at least to experiment.
How hard would implementing BPE be? I don't think it would be that hard? I am not sure though
I don't think implementing bpe will be hard.
Hi @oxinabox, I have completed Sentencepiece Processor in Julia Here is my blog with implementation detail. for now, It only supported sentence piece processor
You should talk to @Ayushk4
@tejasvaidhyadev , I am looking at the code for Sentenpiece. I will get back to you in a few hours.
@tejasvaidhyadev What is the difference between the 2 files here? https://gist.github.com/tejasvaidhyadev/21a092ff3fe1f2c146a60af44b9519c1
@tejasvaidhyadev What is the difference between the 2 files here? https://gist.github.com/tejasvaidhyadev/21a092ff3fe1f2c146a60af44b9519c1
Both are the same ignore the other one. I will add more docstring for a better explanation
Okay. The code looks okay to me, I will go in detail once you send the PR. A couple of things regarding the API-
DataDeps
for downloading the vocab file. So API will be something like-
tokenizer = SentencePiece(ALBERT)
this can be similar to Embeddings.jl as you suggested.
We can also have another API for
tokenizer = SentencePiece(vocab_path::String)
I think using TokenBuffer will not be possible here since the decoding is done by viterbi algo here. Try to keep type conversions in the forward_algo minimal since the forward_algo will be O(n^2).
Since this will not be using tokenbuffer and we can expect more statistical algo to be added to this package, I think we can have the code for sentencepiece inside new folder src/statistical.
@oxinabox what do you suggest on this?
I think that all sounds good to me. Yes.
Close this given #51 is merged?
There are other statistical tokenization like wordpiece and BPE that are yet to be added.
right, but should open superate issues for those as needed? Since there will always be some statisticallal tokenization algorithm we done have?
Okay.
BERT and related models have been using statistical tokenization algorithms. These work well on out-of-vocab words with ML models. High-speed implementations of BPE / WordPiece etc. will be good additions to the package.
Some nice work has been going here
Thanks to @ ninjin for the suggestion.