ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.92k stars 372 forks source link

Domain specific corpus, pretrained models, model evaluation #51

Closed woiza closed 3 years ago

woiza commented 3 years ago

Hi,

thanks for sharing your work! I am fairly new to NLP and were able to use it. However, I am not sure if I am doing it right. I have a domain specific German dataset with ~11k documents. The number of words in a document range from 10 to 900, mean is 67. Documents are cased, contain some domain specific abbreviations and terms as well as spelling mistakes, numbers and special characters (data gathered from actual usage, sloppy language). I used your code (doc2vec, learn) without any text preprocessing and the results look promising!

One document consists of n sentences with k words per sentence:

print(documents[0]) Word1-1 word1-2 Word1-3 Wrd1-4abbreviated. 123 (word1-5) word1-6. word2-1 word2-2abbreviated. word2-3 WORD2-4 WORD2-5 word2-6/word2-7 word2-8...

Is this correct or should I have only words/tokens without sentences and punctuation? How does your code detect the beginning and the end of a sentence, especially if the sentence contains a misspelled or domain-specific abbreviation? What preprocessing would you recommend?

Let's say I want to compare different models with each other and only use the best model for your REST API:

model_doc2vec_learn = Top2Vec(documents, 'learn') model_doc2vec_deeplearn = Top2Vec(documents, 'deep-learn') model_universal-sentence-encoder-multilingual = Top2Vec(documents, embedding_model='universal-sentence-encoder') model_distiluse-base-multilingual-cased = Top2Vec(documents, embedding_model='distiluse-base-multilingual-cased') etc.

What metrics should I use and how can this be achieved with your implementation?

"For large data sets and data sets with very unique vocabulary doc2vec could produce better results... The universal sentence encoder options are suggested for smaller data sets. "

Is my dataset (11k documents) large? It has a rather unique vocabulary... What pretrained models would you recommend?

Do I have to use sentence encoders or can I use pretrained transformers such as the following as well? https://huggingface.co/bert-base-german-cased

ddangelov commented 3 years ago

Top2Vec has a built in tokenizer, but you can specify your own with the tokenizer parameter. I think the default should be fine for your purposes, you should not have to do any additional preprocessing.

Evaluation should be based on your downstream task. I would recommend trying doc2vec or the pre-trained universal-sentence-encoder-multilingual for embedding_model. If you want a sentence transformer use the model_distiluse-base-multilingual-cased, however they are much slower and also have a token limit of 512, so this should be your last choice. For topic modeling the universal sentence encoder options are preferred.