What function to use when checking simmilarity between documents

rdatasculptor commented 5 years ago

Dear Jan,

First of all: thank you for this brilliant package! For me it has been very useful for textclassification tasks.

Now I have another problem at hand and I was wondering if ruimtehol could be of any help. I have a couple of hundred text documents. Is there a ruimtehol function that could help me find a ranking in simmilarity between these couple of documents and a completely new text document. So I have a new document and I want to check which documents have the highest simmilarity. My best guess was embed_articlespace(), but I couldn't find an example that steems to do exactly what I want. Is there an example somewhere or doesn't ruimtehol fit my research goal and do I have to take a look elsewhere? Many thanks in advance!

rdatasculptor commented 5 years ago

Okay, I think I figured it out thanks to the presentation in your post https://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol.

jwijffels commented 5 years ago

embed_articlespaceis for the setting where you have a bunch of articles (e.g. wikipedia articles) and you have a new text and want to see to which article does that new text resemble the most. Or for similar settings (e.g. where you have a knowledge base of answers on questions and you have a new question and want to see to which of these answers does the question resemble the most)

And yes, example is given in that presentation.

rdatasculptor commented 5 years ago

Thanks!

jwijffels commented 5 years ago

@rdatasculptor out of curiosity, for what textclassification / article recommendation exercise have you applied the model?

rdatasculptor commented 5 years ago

@jwijffels I put some texts with personality labels in the model. I have to do more research, but I am starting to believe ruimtehol seems to perform better than Watson :-)

jwijffels commented 5 years ago

Ok, thanks for the input. It's indeed a swiss army knife if you tune the hyperparameters such that it learns something.

rdatasculptor commented 5 years ago

yes it is! I am still trying to understand what word embeddings are or how they are calculated exactly in ruimtehol. This field is rather new to me, but very interesting.

jwijffels commented 5 years ago

word embeddings (as in embed_wordspace) are just a bunch of numbers which are similar for words which are used in the neighbourhood of one another.

rdatasculptor commented 5 years ago

it seems a very strong way of making a representation of the content and meaning of texts

rdatasculptor commented 5 years ago

Two additional questions:

In your presentation you use the variable allarticles$text. I guess that's the same as the dekamer$x?
If I want to look for similar documents and the input is a document as well (meaning more than one sentence), I understand I should use embed_articlespace. The input of this function is one sentence at a time. How to deal with a document with more than one sentence as an input?

jwijffels commented 5 years ago

About question 1 I see. That presentation is a knitr document. On page 24 it also had the following but it was not shown due to the printing of head(knowledgebase)

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

About question 2. The input to embed_articlespace is (see documentation of that function)

a data.frame with sentences containing the columns doc_id, sentence_id and token The doc_id is just an article or document identifier, the sentence_id column is a character field which contains words which are separated by a space and should not contain any tab characters If you have several sentences per article that would be just looking as follows.

> library(udpipe)
> x <- udpipe(c("You have a question. Go to the doctor.", "Margareth Thatcher is a former PM of the UK. She is blablabla"), "english")[, c("doc_id", "sentence_id", "token")]
> x
 doc_id sentence_id     token
   doc1           1       You
   doc1           1      have
   doc1           1         a
   doc1           1  question
   doc1           1         .
   doc1           2        Go
   doc1           2        to
   doc1           2       the
   doc1           2    doctor
   doc1           2         .
   doc2           1 Margareth
   doc2           1  Thatcher
   doc2           1        is
   doc2           1         a
   doc2           1    former
   doc2           1        PM
   doc2           1        of
   doc2           1       the
   doc2           1        UK
   doc2           1         .
   doc2           2       She
   doc2           2        is
   doc2           2 blablabla

rdatasculptor commented 5 years ago

Thank you for your answers! Regarding question 2, I must admit there was an error. I meant the predict function for checking which documents are most similar to a given sentence. scores <-predict(model,"wat was de precieze oorzaak van de technische problemen",basedoc =allarticles$text). What if there is a complete document instead of one sentence that I want to predict it's similar documents of?

jwijffels commented 5 years ago

Use the same code as shown in the answer I gave on question 1. So provide a character string where words are separated by spaces and sentences are added with the tab separator. As in predict(model, “wat was de precieze oorzaak van de technische problemen \t wat viel er in panne \t welke dienst heeft u gebeld”)

rdatasculptor commented 5 years ago

Thanks again! It is completely clear now.

jwijffels commented 5 years ago

Note, It should be " \t ", not "\t" to separate the sentences

allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]

rdatasculptor commented 5 years ago

Okay thanks! I altered my code. You are really helpful.

rdatasculptor commented 4 years ago

Hi Jan Following issue #22, I should use "\t" now as a sentence seperator instead of " \t " after updating to the latest github version ofcourse)?

jwijffels commented 4 years ago

Yes, correct!

rdatasculptor commented 4 years ago

I guess there's no easy way to install the github version without having to use RTools? I still work in a restricted network unfortenately.

jwijffels commented 4 years ago

yes, you need RTools on Windows. Which version of R on windows are you on?

rdatasculptor commented 4 years ago

3.5.1

jwijffels commented 4 years ago

This is a binary of ruimtehol 0.2.2 for Windows 3.5.1: http://www.datatailor.be/ruimtehol_0.2.2.zip

rdatasculptor commented 4 years ago

thanks!

bnosac / ruimtehol

What function to use when checking simmilarity between documents #16