Closed rdatasculptor closed 5 years ago
Okay, I think I figured it out thanks to the presentation in your post https://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol.
embed_articlespace
is for the setting where you have a bunch of articles (e.g. wikipedia articles) and you have a new text and want to see to which article does that new text resemble the most. Or for similar settings (e.g. where you have a knowledge base of answers on questions and you have a new question and want to see to which of these answers does the question resemble the most)
And yes, example is given in that presentation.
Thanks!
@rdatasculptor out of curiosity, for what textclassification / article recommendation exercise have you applied the model?
@jwijffels I put some texts with personality labels in the model. I have to do more research, but I am starting to believe ruimtehol seems to perform better than Watson :-)
Ok, thanks for the input. It's indeed a swiss army knife if you tune the hyperparameters such that it learns something.
yes it is! I am still trying to understand what word embeddings are or how they are calculated exactly in ruimtehol. This field is rather new to me, but very interesting.
word embeddings (as in embed_wordspace
) are just a bunch of numbers which are similar for words which are used in the neighbourhood of one another.
it seems a very strong way of making a representation of the content and meaning of texts
Two additional questions:
embed_articlespace
. The input of this function is one sentence at a time. How to deal with a document with more than one sentence as an input?About question 1
I see. That presentation is a knitr document. On page 24 it also had the following but it was not shown due to the printing of head(knowledgebase)
allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]
About question 2. The input to embed_articlespace is (see documentation of that function)
a data.frame with sentences containing the columns doc_id, sentence_id and token The doc_id is just an article or document identifier, the sentence_id column is a character field which contains words which are separated by a space and should not contain any tab characters If you have several sentences per article that would be just looking as follows.
> library(udpipe)
> x <- udpipe(c("You have a question. Go to the doctor.", "Margareth Thatcher is a former PM of the UK. She is blablabla"), "english")[, c("doc_id", "sentence_id", "token")]
> x
doc_id sentence_id token
doc1 1 You
doc1 1 have
doc1 1 a
doc1 1 question
doc1 1 .
doc1 2 Go
doc1 2 to
doc1 2 the
doc1 2 doctor
doc1 2 .
doc2 1 Margareth
doc2 1 Thatcher
doc2 1 is
doc2 1 a
doc2 1 former
doc2 1 PM
doc2 1 of
doc2 1 the
doc2 1 UK
doc2 1 .
doc2 2 She
doc2 2 is
doc2 2 blablabla
Thank you for your answers!
Regarding question 2, I must admit there was an error. I meant the predict function for checking which documents are most similar to a given sentence. scores <-predict(model,"wat was de precieze oorzaak van de technische problemen",basedoc =allarticles$text)
. What if there is a complete document instead of one sentence that I want to predict it's similar documents of?
Use the same code as shown in the answer I gave on question 1. So provide a character string where words are separated by spaces and sentences are added with the tab separator. As in predict(model, “wat was de precieze oorzaak van de technische problemen \t wat viel er in panne \t welke dienst heeft u gebeld”)
Thanks again! It is completely clear now.
Note, It should be " \t ", not "\t" to separate the sentences
allarticles <- data.table::setDT(knowledgebase)
allarticles <- allarticles[, list(sentence = paste(token, collapse = " ")), by = list(doc_id, sentence_id)]
allarticles <- allarticles[, list(text = paste(sentence, collapse = " \t ")), by = list(doc_id)]
Okay thanks! I altered my code. You are really helpful.
Hi Jan Following issue #22, I should use "\t" now as a sentence seperator instead of " \t " after updating to the latest github version ofcourse)?
Yes, correct!
I guess there's no easy way to install the github version without having to use RTools? I still work in a restricted network unfortenately.
yes, you need RTools on Windows. Which version of R on windows are you on?
3.5.1
This is a binary of ruimtehol 0.2.2 for Windows 3.5.1: http://www.datatailor.be/ruimtehol_0.2.2.zip
thanks!
Dear Jan,
First of all: thank you for this brilliant package! For me it has been very useful for textclassification tasks.
Now I have another problem at hand and I was wondering if ruimtehol could be of any help. I have a couple of hundred text documents. Is there a ruimtehol function that could help me find a ranking in simmilarity between these couple of documents and a completely new text document. So I have a new document and I want to check which documents have the highest simmilarity. My best guess was
embed_articlespace()
, but I couldn't find an example that steems to do exactly what I want. Is there an example somewhere or doesn't ruimtehol fit my research goal and do I have to take a look elsewhere? Many thanks in advance!