4OH4 / doc-similarity

Ranking documents using semantic similarity in Python
MIT License
35 stars 24 forks source link

Glove method - lemmatization #8

Open MarekKeskyll opened 3 years ago

MarekKeskyll commented 3 years ago

Hi! Questions:

  1. Why didn't you use lemmatization when processing your document's? Is there a reason behind that?
  2. Why did you use this glove pre-trained model(dimensions)?
  3. Can you validate the results somehow?
4OH4 commented 3 years ago

Hi there,

Good question on the use of lemmatization - I did use it on the TF-idf model, but not for Glove. I think its more important for TF-idf, in order to get accurate word counts. I don't remember why I did not use it with the Glove model (the example is based on some project work that I did, but borrows heavily from the documentation) - I would expect that I tried it and found that for the particular use case I was looking at it did not offer a performance benefit.

The glove-wiki-gigaword-50 model is the smallest of the Gensim models trained on Wikipedia, in terms of model complexity. I was originally looking at near real-time processing of high volumes of data, so compute requirements and latency was an issue. This model is the fastest to run. You would expect to get accuracy benefits if moving to a more complex model, although they may be small and at the cost of significant additional memory requirements.

For my application, I was comparing against human operators that were conducting information retrieval tasks. Our metric was (something like) how often does the most similar document appear in the top-1, top-3, or top-10 positions. That is quite a hard route for validation though, and can be quite expensive. Perhaps validation against an equivalent gold standard automation technique might be better? TF-idf is an established technique that is well used, so is a reasonable baseline against which you could compare more advanced techniques such as Glove, to see if they offer an improvement.