Make Pre-processing options work for PreTrainedVectorizer - Githubissues

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

http://sadedegel.ai

MIT License

93 stars 15 forks source link

Make Pre-processing options work for PreTrainedVectorizer #307

Open dafajon opened 2 years ago

dafajon commented 2 years ago

Currently get_pretrained_embeddings, get_bert_embeddings work on the raw form of the document. As a result preprocessing settings do not apply to the text that goes into the transformer based vectorizers.

Add ignore_preprocess option to vectorizer to use raw text.
Build input str sequence from filtered Token objects before passing it to the SentenceTransformer.encode method.