Open chengchingwen opened 2 years ago
The new tokenizer api (using TextEncodeBase) is basically finished and included in the 0.1.16 release, though the gpt part is ignored for now. For the next step, I will be fixing the huggingface download issue with HuggingFaceApi.jl. Rewriting the attention layer might be breaking, so that would probably be the last one to do.
Some other issue that might also need to be tracked:
@chengchingwen Peter, what is the approximate timeframe for implementing the model transfer from Huggingface?
@MNLubov Are you looking for a specific model from HuggingFace? I'm trying to fix the huggingface module this month, so if everything goes well, it would be workable again before August.
Just to clarify, even if that huggingface module is fixed, it's still possible that we don't have the implementation for that model type (by model type, I mean something like bert
, gpt2
, t5
etc). So if you are looking for a model type that we don't have, please open another issue (and the timeline would be unknown for now
@chengchingwen Thanks for the clarification. Currently I am testing different sentence-transformers from Huggingface to find the most suitable for my purposes. As a temporary solution, I use PyCall to find the most suitable one.
As far as I understand you have now bert
, gpt
and roberta
implementations.
@MNLubov Yes. I haven't investigate the sentence-transformers implementation, but it seem that it can also be done with normal huggingface interface. Like this one https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2, it's a bert
model, so it should be workable following the huggingface transformer usage in the readme after we fix the module.
Here are some stuff I'm going to rewrite for the new release:
Basic.Vocabulary
withTextEncodeBase.Vocab
.feel free to add comments.