rdatasculptor commented 4 years ago

I was wondering, could it be possible to build a wordpredictor model with Ruimtehol? A prediction of the most likely next word when a sequence of words, meaning a part of a sentence, is given?

I was thinking of the label prediction algorithm (tagspace, if I am correct) . But then we should feed the model all possible parts of a sentence and all next words as labels. I am not sure if that's the way to go. Is there an easier way?

Many thanks in advance!

jwijffels commented 4 years ago

In principle this could indeed be an approach. Haven't tried this myself.

I would be interested to see what you want to do with such a thing and how such a setup would compare to using a transformer model.

Is your objective to write poetry?

rdatasculptor commented 4 years ago

My objective isn't to write poetry. I am just curious if it could potentially be used for e.g. helping people choose next words by writing text, like some some smart phones do.

I made an example (and yes the code could be prettier, it's just a quick proof of concept...):


data(dekamer, package = "ruimtehol")
textdata <- dekamer
textdata <- filter(textdata,

ud_model <- udpipe_load_model("dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(ud_model, x = textdata$answer, doc_id = textdata$doc_id)
x <-
x <- filter(x, upos != "SYM")
x <- filter(x, upos != "PUNCT")
x <- filter(x, upos != "NUM")

docids <- unique(x$doc_id)
datalist <- list()
counter <- 0
for (i in 1:length(docids)){
  df0 <- filter(x,doc_id==docids[i] & nchar(token) > 1)
  df0$token_id <- as.numeric(df0$token_id)
  for (j in 1:max(as.numeric(df0$token_id))-1){
    df <-[1:j,], doc_id),text = paste(token, collapse = " ")))
    df$label <- df0$token[j+1]
    df$number_of_tokens <- j
    counter <- counter + 1
    datalist[[counter]] <- df
  print(paste0(i," of ",length(docids)))
textdata <- bind_rows(datalist)
textdata <- filter(textdata, number_of_tokens != 0)
textdata$text2 <- strsplit(textdata$text, "\\W")
textdata$text2 <- lapply(textdata$text2, FUN = function(x) setdiff(x, ""))
textdata$text2 <- sapply(textdata$text2,
                         FUN = function(x) paste(x, collapse = " "))
textdata <- filter(textdata,

for (j in 1:20){
  selection <- filter(textdata, number_of_tokens == j)
  model <- embed_tagspace(x = tolower(selection$text2),
                          y = selection$label,
                          early_stopping = 0.8,
                          dim = 20, minCount = 1)
  starspace_save_model(model, file = paste0("model",j,".tsv"), method = "tsv-data.table")

word <- filter(textdata, number_of_tokens==2)$text
word <- word[sample(1:length(word), 1)]
sentence <- c()
for (i in 2:10){
  starspace_load_model( paste0("model",i,".tsv"),method = "tsv-data.table")
  df <- predict(model, word, k = 1)[[1]]$prediction$label[1]
  sentence <- paste(sentence,word)
  word <- df

[1] " Er is soort natuurlijke Horizon zonder wegens akkoord communicatie bevestigd"

So in the end you could say this is poetry :-). At least it's quite cryptic....

Do you have an example of a transformer script? Then I can try that as well.

jwijffels commented 4 years ago

You can now also officially participate in the National Novel Generation Month

Nice setup but a lot of struggling of course due to text segments being variable length sequences. I bet if you take training data from a real author, that would generate quite some nice poetry :+1: What would also give astonishing results is do this instead of on words on word segments (e.g. with package at

But to get something more meaningfull, longer sequences will need to be taken into account. Starspace is not a sequence model.

I'm not aware of any Transformer models built using R though.

rdatasculptor commented 4 years ago

That tokenizers package is very interesting. I will try that idea!

jwijffels commented 4 years ago

Note however that that will generate probably a also quite some poetry with unexpected conjugations.

Regarding transformers, the only C++ Transformer model, which I could once make an Rcpp around that I know of is at Requires GPU power however to train it. If someone else has other links, I would be happy to make an Rcpp wrapper if something C++ like exists.

rdatasculptor commented 4 years ago

Thanks for all the insides!