dylanjcastillo / blog_comments

dylancastillo.co comments
1 stars 0 forks source link

posts/nlp-snippets-cluster-documents-using-word2vec/ #5

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

How to Cluster Documents Using Word2Vec and K-means

Learn how to cluster documents using Word2Vec. In this tutorial, you'll train a Word2Vec model, generate word embeddings, and use K-means to create groups of news articles.

https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/

paddyxxx commented 3 years ago

Hello Dylan! Thank you for the tutorial - it is extremely useful. However, I have one tiny problem. I am trying to run the top 100 predictive words through a pretrained word2vec model with this code and unfortunately I am not getting the right results. I am creating a variable list_of_docs which consists of all the 100 words and tokenized_docs which consists of the tokenized data from the 100 predictive words - and then I am running this code. However, it seems that the 100 words really do not cross through the word2vec as the cluster output I get consists of words which are probably contained in the pretrained Wikipedia word2vec model but not of my 100 words. Do you have many any idea where I am going wrong? - Thank you for your help in advance! I appreciate it.

paddyxxx commented 3 years ago
dylanjcastillo commented 3 years ago

Hi @paddyxxx,

Thank you. I'm glad you found it useful!

Can you share a code snippet with what you're trying to do? I'm not entirely sure that I understand the issue.

Best, Dylan

ChinmayRojindar commented 3 years ago

Hi, Dylan great explanation.I applied for an Data science job in a company recently and the company gave me same kind of problem and I didn't get selected😑, but the data was unstructured HTML tables on financial docs. I have attached the link below of the dataset. I directly tried to read the docs and wanted to create a feature list for each docs or should I structure the dataset first but how to handle tabular data with so many docs, if you can help please check, Thanks

https://www.kaggle.com/drcrabkg/financial-statements-clustering

raydesentz commented 3 years ago

Hi Dylan, thanks for this breakdown, it is helping so much. I am having some trouble with the training of the Word2Vec Model. You say "sentences=tokenized_docs". I generally understand what it is asking for, but I am not seeing how/where tokenized_docs was defined in the code leading up to the training of the word2vec model so I am a little unsure of where I need to go with this and what exactly needs to go there. Thanks!

Edit to my above comment: Would this just be the tokens column from defined df with text and tokens column?

It was exactly that, so please disregard! I was not reading properly!

adambenari commented 2 years ago

I'm getting the error "NameError: name 'docs' is not defined" towards the end - where did you define docs? Thanks

rubenvisser22 commented 2 years ago

docs = df["text"].values

EricSHo commented 1 year ago

I've been enjoying this tutorial. But I encountered a problem in the creating the Word2Vec model in Colab:

model = Word2Vec(sentences=tokenized_docs, vector_size=100, workers=1, seed=SEED)

When I changed to

model = Word2Vec(sentences=tokenized_docs, size=100, workers=1, seed=SEED)

It worked. Is it because of different Word2Vec release?

Thanks, Eric.

russellclaude commented 1 year ago

df = df_raw.copy()

is df pandas or not? I cannot get this code in the "Clean and Tokenize Data" script to work.

dylanjcastillo commented 1 year ago

@russellclaude, sorry I somehow removed that from the code. You need to read the data first.

df_raw = pd.read_csv("data/news_data.csv")
dylanjcastillo commented 1 year ago

@EricSHo, sorry for the late reply. That must be because you're using a different version of gensim.

russellclaude commented 1 year ago

@russellclaude, sorry I somehow removed that from the code. You need to read the data first.

df_raw = pd.read_csv("data/news_data.csv")

Thanks! I was able to get it working through trial and error. This line also needed to be added in setup:

nltk.download('punkt')

dylanjcastillo commented 1 year ago

You’re right. Thank you!