Domain specific corpus, pretrained models, model evaluation

Hi,

thanks for sharing your work! I am fairly new to NLP and were able to use it. However, I am not sure if I am doing it right. I have a domain specific German dataset with ~11k documents. The number of words in a document range from 10 to 900, mean is 67. Documents are cased, contain some domain specific abbreviations and terms as well as spelling mistakes, numbers and special characters (data gathered from actual usage, sloppy language). I used your code (doc2vec, learn) without any text preprocessing and the results look promising!

One document consists of n sentences with k words per sentence:

print(documents[0]) Word1-1 word1-2 Word1-3 Wrd1-4abbreviated. 123 (word1-5) word1-6. word2-1 word2-2abbreviated. word2-3 WORD2-4 WORD2-5 word2-6/word2-7 word2-8...

Is this correct or should I have only words/tokens without sentences and punctuation? How does your code detect the beginning and the end of a sentence, especially if the sentence contains a misspelled or domain-specific abbreviation? What preprocessing would you recommend?

Let's say I want to compare different models with each other and only use the best model for your REST API:

model_doc2vec_learn = Top2Vec(documents, 'learn') model_doc2vec_deeplearn = Top2Vec(documents, 'deep-learn') model_universal-sentence-encoder-multilingual = Top2Vec(documents, embedding_model='universal-sentence-encoder') model_distiluse-base-multilingual-cased = Top2Vec(documents, embedding_model='distiluse-base-multilingual-cased') etc.

What metrics should I use and how can this be achieved with your implementation?

"For large data sets and data sets with very unique vocabulary doc2vec could produce better results... The universal sentence encoder options are suggested for smaller data sets. "

Is my dataset (11k documents) large? It has a rather unique vocabulary... What pretrained models would you recommend?

Do I have to use sentence encoders or can I use pretrained transformers such as the following as well? https://huggingface.co/bert-base-german-cased

ddangelov / Top2Vec

Domain specific corpus, pretrained models, model evaluation #51