bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

Prediction of next word in a sentence #23

Closed rdatasculptor closed 4 years ago

rdatasculptor commented 4 years ago

I was wondering, could it be possible to build a wordpredictor model with Ruimtehol? A prediction of the most likely next word when a sequence of words, meaning a part of a sentence, is given?

I was thinking of the label prediction algorithm (tagspace, if I am correct) . But then we should feed the model all possible parts of a sentence and all next words as labels. I am not sure if that's the way to go. Is there an easier way?

Many thanks in advance!

jwijffels commented 4 years ago

In principle this could indeed be an approach. Haven't tried this myself.

I would be interested to see what you want to do with such a thing and how such a setup would compare to using a transformer model.

Is your objective to write poetry?

rdatasculptor commented 4 years ago

My objective isn't to write poetry. I am just curious if it could potentially be used for e.g. helping people choose next words by writing text, like some some smart phones do.

I made an example (and yes the code could be prettier, it's just a quick proof of concept...):

library(udpipe)
library(dplyr)
library(ruimtehol)

data(dekamer, package = "ruimtehol")
textdata <- dekamer
textdata <- filter(textdata, is.na(answer)==FALSE)

ud_model <- udpipe_load_model("dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(ud_model, x = textdata$answer, doc_id = textdata$doc_id)
x <- as.data.frame(x)
x <- filter(x, upos != "SYM")
x <- filter(x, upos != "PUNCT")
x <- filter(x, upos != "NUM")

docids <- unique(x$doc_id)
datalist <- list()
counter <- 0
for (i in 1:length(docids)){
  df0 <- filter(x,doc_id==docids[i] & nchar(token) > 1)
  df0$token_id <- as.numeric(df0$token_id)
  for (j in 1:max(as.numeric(df0$token_id))-1){
    df <- as.data.frame(summarise(group_by(df0[1:j,], doc_id),text = paste(token, collapse = " ")))
    df$label <- df0$token[j+1]
    df$number_of_tokens <- j
    counter <- counter + 1
    datalist[[counter]] <- df
  }
  print(paste0(i," of ",length(docids)))
}
textdata <- bind_rows(datalist)
textdata <- filter(textdata, number_of_tokens != 0)
textdata$text2 <- strsplit(textdata$text, "\\W")
textdata$text2 <- lapply(textdata$text2, FUN = function(x) setdiff(x, ""))
textdata$text2 <- sapply(textdata$text2,
                         FUN = function(x) paste(x, collapse = " "))
textdata <- filter(textdata, is.na(label)==FALSE)

for (j in 1:20){
  set.seed(123456789)
  selection <- filter(textdata, number_of_tokens == j)
  model <- embed_tagspace(x = tolower(selection$text2),
                          y = selection$label,
                          early_stopping = 0.8,
                          dim = 20, minCount = 1)
  starspace_save_model(model, file = paste0("model",j,".tsv"), method = "tsv-data.table")
  plot(model)
}

word <- filter(textdata, number_of_tokens==2)$text
set.seed(12345)
word <- word[sample(1:length(word), 1)]
sentence <- c()
for (i in 2:10){
  starspace_load_model( paste0("model",i,".tsv"),method = "tsv-data.table")
  df <- predict(model, word, k = 1)[[1]]$prediction$label[1]
  sentence <- paste(sentence,word)
  word <- df
}

sentence
[1] " Er is soort natuurlijke Horizon zonder wegens akkoord communicatie bevestigd"

So in the end you could say this is poetry :-). At least it's quite cryptic....

Do you have an example of a transformer script? Then I can try that as well.

jwijffels commented 4 years ago

You can now also officially participate in the National Novel Generation Month https://github.com/NaNoGenMo

Nice setup but a lot of struggling of course due to text segments being variable length sequences. I bet if you take training data from a real author, that would generate quite some nice poetry :+1: What would also give astonishing results is do this instead of on words on word segments (e.g. with package at https://github.com/bnosac/tokenizers.bpe)

But to get something more meaningfull, longer sequences will need to be taken into account. Starspace is not a sequence model.

I'm not aware of any Transformer models built using R though.

rdatasculptor commented 4 years ago

That tokenizers package is very interesting. I will try that idea!

jwijffels commented 4 years ago

Note however that that will generate probably a also quite some poetry with unexpected conjugations.

Regarding transformers, the only C++ Transformer model, which I could once make an Rcpp around that I know of is at https://github.com/marian-nmt/marian Requires GPU power however to train it. If someone else has other links, I would be happy to make an Rcpp wrapper if something C++ like exists.

rdatasculptor commented 4 years ago

Thanks for all the insides!