Branching from feature/bert_sentence_embedding branch.
Via help of sentence-transformers dependency get_pretrained_embeddings method is implemented with the likeness of bert_embeddings. However this is not a property method, it receives architecture type and whether to produce sentence or document embeddings with archtecture and do_sents arguments.
Necessary checks are done for sentence-transformer dependency and available architectures.
Implement a sklearn compatible PreTrainedVectorizer that receives same arguments and wraps the Document method access of pretrained model embeddings.
The vectorizer does not need Text2Doc to be used beforehand on pipeline. This is because transformers based models require minimal preprocessing and the sentence-transformer dependency already handles architecture based tokenization. So the Document to be serialized is initialized as class attribute with default settings of the DocBuilder.
This may change in the future if preprocessing with filter attributes of Document proves useful.
Future
Currently architecture selection is restricted to available TR models. Custom fine tuned or pre-trained architecture support for any user's any model on huggingface hub will be available.
feature/bert_sentence_embedding
branch.sentence-transformers
dependencyget_pretrained_embeddings
method is implemented with the likeness ofbert_embeddings
. However this is not a property method, it receives architecture type and whether to produce sentence or document embeddings witharchtecture
anddo_sents
arguments.sentence-transformer
dependency and available architectures.sklearn
compatiblePreTrainedVectorizer
that receives same arguments and wraps theDocument
method access of pretrained model embeddings.Text2Doc
to be used beforehand on pipeline. This is because transformers based models require minimal preprocessing and thesentence-transformer
dependency already handles architecture based tokenization. So theDocument
to be serialized is initialized as class attribute with default settings of theDocBuilder
.Document
proves useful.Future
huggingface
hub will be available.