ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.95k stars 374 forks source link

ValueError(f"A min_count of {min_count} results in... - with German text #315

Open levrone1987 opened 1 year ago

levrone1987 commented 1 year ago

Given a text corpus (German language), I get the following error with the code shown below:

raise ValueError(f"A min_count of {min_count} results in " ValueError: A min_count of 50 results in all words being ignored, choose a lower value.

The code:

top2vec_model = Top2Vec(corpus, speed="learn", workers=8, embedding_model='distiluse-base-multilingual-cased')

count_id = 0
similar_top2vec = top2vec_model.search_documents_by_documents(doc_ids=[count_id])

I had to change vectorizer.get_feature_names() to vectorizer.get_feature_names_out() in Top2Vec.py in order to avoid the error associated with the missing get_feature_names method, but now I experience the above error.

Lotfi-AL commented 1 year ago

Could this issue be caused by top2vec finding 0 topics?

levrone1987 commented 1 year ago

@Lotfi-AL I cannot check the number of topics, because the error is already is in the line where Top2Vec object is created. This is the full error message:

INFO:top2vec:Pre-processing documents for training
/home/oem/anaconda3/envs/news-env/lib/python3.8/site-packages/sklearn/feature_extraction/text.py:528: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(
Traceback (most recent call last):
  File "/home/oem/news_recos/main.py", line 179, in <module>
    top2vec_model = Top2Vec(corpus, speed="learn", workers=8, embedding_model='distiluse-base-multilingual-cased')
  File "/home/oem/anaconda3/envs/news-env/lib/python3.8/site-packages/top2vec/Top2Vec.py", line 587, in __init__
    raise ValueError(f"A min_count of {min_count} results in "
ValueError: A min_count of 50 results in all words being ignored, choose a lower value.

If the above cannot be resolved, I would appreciate a sample code for processing a corpus of text written in German.

levrone1987 commented 1 year ago

I would really appreciate if someone could answer the question I stated above. @Lotfi-AL @ddangelov

ddangelov commented 1 year ago

This is likely due to your dataset being too small. Set min_count==0 and also try using a larger dataset.