Use sentence-transformers library to utilize a pre-trained model in Turkish.
The SentenceTransformer object and its encode method handles tokenization with pre-trained model's own tokenizer.
No need to use BertTokenizer related utilities of sadedegel anymore for bert_embeddings generation.
Add bert_document_embedding to generate a single embedding vector of the whole document. Utilize sentence-transformers likewise.
bert_embeddings and bert_document_embedding are independent of the configured tokenizer of the Doc object. The tokenization is handled with the new dependency. This will allow in the future for usage of other pre-trained models in the HuggingFace Hub.
extra.reuirements.txt
Remove torch and transformers as they are installed as a dependency of sentence-transformers.
setup.py
Add sentence-transformers to dictionary received by extras-require. Remove torch and transformers.
test_building_blocks.py
Add test case for long sequences (longer than 512)
Remove NotImplementedError catching from test case since embedding generation in de-coupled from the configured tokenizer.
Future
Implement a get_pretrained_embeddings method that wraps any transformers based Turkish pre-trained model (readily available or custom).
Use that method insinde a sklearn compatible PreTrainedVectorizer class.
bblock/doc.py
bert_embeddings
.sentence-transformers
library to utilize a pre-trained model in Turkish.SentenceTransformer
object and itsencode
method handles tokenization with pre-trained model's own tokenizer.BertTokenizer
related utilities ofsadedegel
anymore forbert_embeddings
generation.bert_document_embedding
to generate a single embedding vector of the whole document. Utilizesentence-transformers
likewise.bert_embeddings
andbert_document_embedding
are independent of the configured tokenizer of theDoc
object. The tokenization is handled with the new dependency. This will allow in the future for usage of other pre-trained models in theHuggingFace Hub
.extra.reuirements.txt
torch
andtransformers
as they are installed as a dependency ofsentence-transformers
.setup.py
sentence-transformers
to dictionary received byextras-require
. Removetorch
andtransformers
.test_building_blocks.py
NotImplementedError
catching from test case since embedding generation in de-coupled from the configured tokenizer.Future
get_pretrained_embeddings
method that wraps anytransformers
based Turkish pre-trained model (readily available or custom).PreTrainedVectorizer
class.