bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Space optimization #26

Closed pgrandinetti closed 6 years ago

pgrandinetti commented 6 years ago

Do you have recommendation on how to manage memory taken from executing as.data.frame(udpipe_annotate(udmodel, x=documents))? Is there anything similar to database storage, like (I think) tm does?

When I run that line on a collection of 10 documents, that are quite normal blog articles, the resulting data frame takes about 2.5MB in memory. Of course, I am planning to do it on many more documents and am unsure whether memory will be an issue or not.

P.S. For this small sample where the model is 2.5MB for 20748 obs. of 14 variables, the computation takes approx 20 seconds. Seems a lot to me, but is that in line with your benchmarks?

Thanks!

jwijffels commented 6 years ago

Choose any DB that you like to store the annotated data. This R package does not and will not provide these. If you have 1 Mio documents with each time 200 words, that will generate 200Mio records. I tend to parallelise the annotation and then store in a database, but that depends on the number of records which I expect to have. You can easily do the multiplication. About speed. Please see page 10 of the paper at http://ufal.mff.cuni.cz/%7Estraka/papers/2017-conll_udpipe.pdf where you see the speed in words/second. You can easily compare that to your dataset. If you want to parallelise things to speed up, see an example here: https://github.com/bnosac/udpipe/issues/24 or request only a part of the annotation as documented at: https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html

I also tend to use these arguments, traceprints now something every 100 annotated documents and adding term_id = TRUE will add a column which is an integer which is sequential number starting from 1 up to the number of terms in the document. This is probably something you need if you work with databases.

x <- udpipe_annotate(udmodel, x = txt, trace = 100)
x <- as.data.frame(x, term_id = TRUE)
str(x)
'data.frame':   111 obs. of  15 variables:
 $ doc_id       : chr  "doc1" "doc1" "doc1" "doc1" ...
 $ paragraph_id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sentence_id  : int  1 1 2 2 2 2 2 2 2 2 ...
 $ sentence     : chr  "Dus." "Dus." "Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook." "Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook." ...
 $ term_id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ token_id     : chr  "1" "2" "1" "2" ...
 $ token        : chr  "Dus" "." "Godvermehoeren" "met" ...
 $ lemma        : chr  "dus" "." "Godvermehoeren" "met" ...
 $ upos         : chr  "PROPN" "PUNCT" "VERB" "ADP" ...
 $ xpos         : chr  NA NA NA NA ...
 $ feats        : chr  "Gender=Com|Number=Sing" NA "VerbForm=Inf" NA ...
 $ head_token_id: chr  "0" "1" "0" "3" ...
 $ dep_rel      : chr  "root" "punct" "root" "case" ...
 $ deps         : chr  NA NA NA NA ...
 $ misc         : chr  "SpaceAfter=No" NA NA NA ...
pgrandinetti commented 6 years ago

I see, thanks. Do you mind showing a minimal example of how do you store annotated data in your favorite DB? Or is simply the entire data.table, column by column? The whole point would be to then have R functions that perform some optimized query, in R language. If you have to either load the entire object in the R memory every time, or to write SQL queries yourself, then it's not terribly useful (although it makes sense and I think it would still very useful to me!). Cheers

jwijffels commented 6 years ago
library(RPostgreSQL)
dbWriteTable(mycon, name = "yourtablename", value = x)

Please look online for how to work with databases if you plan to store data in databases. There are many, many tutorials on that.