Feature/bert sentence embedding

bblock/doc.py

Change implementation of bert_embeddings.
Use sentence-transformers library to utilize a pre-trained model in Turkish.
The SentenceTransformer object and its encode method handles tokenization with pre-trained model's own tokenizer.
No need to use BertTokenizer related utilities of sadedegel anymore for bert_embeddings generation.
Add bert_document_embedding to generate a single embedding vector of the whole document. Utilize sentence-transformers likewise.
bert_embeddings and bert_document_embedding are independent of the configured tokenizer of the Doc object. The tokenization is handled with the new dependency. This will allow in the future for usage of other pre-trained models in the HuggingFace Hub.

extra.reuirements.txt

Remove torch and transformers as they are installed as a dependency of sentence-transformers.

setup.py

Add sentence-transformers to dictionary received by extras-require. Remove torch and transformers.

test_building_blocks.py

Add test case for long sequences (longer than 512)
Remove NotImplementedError catching from test case since embedding generation in de-coupled from the configured tokenizer.

Future

Implement a get_pretrained_embeddings method that wraps any transformers based Turkish pre-trained model (readily available or custom).
Use that method insinde a sklearn compatible PreTrainedVectorizer class.

GlobalMaksimum / sadedegel