Closed rdatasculptor closed 4 years ago
In principle this could indeed be an approach. Haven't tried this myself.
I would be interested to see what you want to do with such a thing and how such a setup would compare to using a transformer model.
Is your objective to write poetry?
My objective isn't to write poetry. I am just curious if it could potentially be used for e.g. helping people choose next words by writing text, like some some smart phones do.
I made an example (and yes the code could be prettier, it's just a quick proof of concept...):
library(udpipe)
library(dplyr)
library(ruimtehol)
data(dekamer, package = "ruimtehol")
textdata <- dekamer
textdata <- filter(textdata, is.na(answer)==FALSE)
ud_model <- udpipe_load_model("dutch-ud-2.0-170801.udpipe")
x <- udpipe_annotate(ud_model, x = textdata$answer, doc_id = textdata$doc_id)
x <- as.data.frame(x)
x <- filter(x, upos != "SYM")
x <- filter(x, upos != "PUNCT")
x <- filter(x, upos != "NUM")
docids <- unique(x$doc_id)
datalist <- list()
counter <- 0
for (i in 1:length(docids)){
df0 <- filter(x,doc_id==docids[i] & nchar(token) > 1)
df0$token_id <- as.numeric(df0$token_id)
for (j in 1:max(as.numeric(df0$token_id))-1){
df <- as.data.frame(summarise(group_by(df0[1:j,], doc_id),text = paste(token, collapse = " ")))
df$label <- df0$token[j+1]
df$number_of_tokens <- j
counter <- counter + 1
datalist[[counter]] <- df
}
print(paste0(i," of ",length(docids)))
}
textdata <- bind_rows(datalist)
textdata <- filter(textdata, number_of_tokens != 0)
textdata$text2 <- strsplit(textdata$text, "\\W")
textdata$text2 <- lapply(textdata$text2, FUN = function(x) setdiff(x, ""))
textdata$text2 <- sapply(textdata$text2,
FUN = function(x) paste(x, collapse = " "))
textdata <- filter(textdata, is.na(label)==FALSE)
for (j in 1:20){
set.seed(123456789)
selection <- filter(textdata, number_of_tokens == j)
model <- embed_tagspace(x = tolower(selection$text2),
y = selection$label,
early_stopping = 0.8,
dim = 20, minCount = 1)
starspace_save_model(model, file = paste0("model",j,".tsv"), method = "tsv-data.table")
plot(model)
}
word <- filter(textdata, number_of_tokens==2)$text
set.seed(12345)
word <- word[sample(1:length(word), 1)]
sentence <- c()
for (i in 2:10){
starspace_load_model( paste0("model",i,".tsv"),method = "tsv-data.table")
df <- predict(model, word, k = 1)[[1]]$prediction$label[1]
sentence <- paste(sentence,word)
word <- df
}
sentence
[1] " Er is soort natuurlijke Horizon zonder wegens akkoord communicatie bevestigd"
So in the end you could say this is poetry :-). At least it's quite cryptic....
Do you have an example of a transformer script? Then I can try that as well.
You can now also officially participate in the National Novel Generation Month https://github.com/NaNoGenMo
Nice setup but a lot of struggling of course due to text segments being variable length sequences. I bet if you take training data from a real author, that would generate quite some nice poetry :+1: What would also give astonishing results is do this instead of on words on word segments (e.g. with package at https://github.com/bnosac/tokenizers.bpe)
But to get something more meaningfull, longer sequences will need to be taken into account. Starspace is not a sequence model.
I'm not aware of any Transformer models built using R though.
That tokenizers package is very interesting. I will try that idea!
Note however that that will generate probably a also quite some poetry with unexpected conjugations.
Regarding transformers, the only C++ Transformer model, which I could once make an Rcpp around that I know of is at https://github.com/marian-nmt/marian Requires GPU power however to train it. If someone else has other links, I would be happy to make an Rcpp wrapper if something C++ like exists.
Thanks for all the insides!
I was wondering, could it be possible to build a wordpredictor model with Ruimtehol? A prediction of the most likely next word when a sequence of words, meaning a part of a sentence, is given?
I was thinking of the label prediction algorithm (tagspace, if I am correct) . But then we should feed the model all possible parts of a sentence and all next words as labels. I am not sure if that's the way to go. Is there an easier way?
Many thanks in advance!