bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Using udpipe without creating file on disk #79

Closed EmilHvitfeldt closed 3 years ago

EmilHvitfeldt commented 3 years ago

I'm trying to use {udpipe} but my use-case doesn't allow me to depend on having a file on disk to read from.

It is possible to use {udpipe} and have the .udpipe model saved in the R environment instead of to a file?

I have tried to read the .udpipe into a character vector and write it to a temporary file when needed but udpipe() doesn't like that.

library(udpipe)

udmodel <- udpipe_download_model(language = "dutch")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/dutch-alpino-ud-2.4-190531.udpipe to /private/var/folders/m0/zmxymdmd7ps0q_tbhx0d_26w0000gn/T/RtmpEd9tDj/reprexe2d274ccfb69/dutch-alpino-ud-2.4-190531.udpipe
#> Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details

xxx <- readr::read_lines(udmodel$file_model)

temp_file <- tempfile()

writeLines(xxx, temp_file)

udmodel$file_model <- temp_file

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
#> Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, : external pointer is not valid

Created on 2020-09-14 by the reprex package (v0.3.0)

jwijffels commented 3 years ago

udpipe models are binary objects. e.g. udpipe_train creates a std::ofstream::binary as in https://github.com/bnosac/udpipe/blob/master/src/rcpp_udpipe.cpp#L154 udpipe_load_model takes the binary object and loads it There is a method which loads it from file:

EmilHvitfeldt commented 3 years ago

Oh! it is a binary file! thank you. I got it working now

library(udpipe)

udmodel <- udpipe_download_model(language = "dutch")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/dutch-alpino-ud-2.4-190531.udpipe to /private/var/folders/m0/zmxymdmd7ps0q_tbhx0d_26w0000gn/T/RtmpB19qoj/reprexe6a740205f6/dutch-alpino-ud-2.4-190531.udpipe
#> Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details

xxx <-  readBin(file(udmodel$file_model, "rb"), "raw", file.info(udmodel$file_model)$size)

temp_file <- tempfile()

writeBin(xxx, temp_file)

udmodel$file_model <- temp_file

udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
       object = udmodel)
#>    doc_id paragraph_id sentence_id
#> 1    doc1            1           1
#> 2    doc1            1           1
#> 3    doc1            1           1
#> 4    doc1            1           1
#> 5    doc1            1           1
#> 6    doc1            1           1
#> 7    doc1            1           1
#> 8    doc1            1           1
#> 9    doc1            1           1
#> 10   doc1            1           1
#> 11   doc1            1           1
#> 12   doc1            1           1
#> 13   doc1            1           1
#> 14   doc1            1           1
#> 15   doc1            1           1
#> 16   doc1            1           1
#> 17   doc1            1           1
#> 18   doc1            1           1
#>                                                                      sentence
#> 1  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 2  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 3  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 4  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 5  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 6  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 7  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 8  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 9  Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 10 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 11 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 12 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 13 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 14 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 15 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 16 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 17 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#> 18 Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.
#>    start end term_id token_id     token     lemma  upos
#> 1      1   2       1        1        Ik        ik  PRON
#> 2      4   7       2        2      ging      gaan  VERB
#> 3      9  10       3        3        op        op   ADP
#> 4     12  15       4        4      reis      reis  NOUN
#> 5     17  18       5        5        en        en CCONJ
#> 6     20  21       6        6        ik        ik  PRON
#> 7     23  25       7        7       nam     nemen  VERB
#> 8     27  29       8        8       mee       mee   ADP
#> 9     30  30       9        9         :         : PUNCT
#> 10    32  35      10       10      mijn      mijn  PRON
#> 11    37  42      11       11    laptop    laptop  NOUN
#> 12    43  43      12       12         ,         , PUNCT
#> 13    45  48      13       13      mijn      mijn  PRON
#> 14    50  58      14       14 zonnebril zonnebril  NOUN
#> 15    60  61      15       15        en        en CCONJ
#> 16    63  66      16       16      goed      goed   ADJ
#> 17    68  73      17       17    humeur    humeur  NOUN
#> 18    74  74      18       18         .         . PUNCT
#>                                           xpos
#> 1                 VNW|pers|pron|nomin|vol|1|ev
#> 2                                WW|pv|verl|ev
#> 3                                      VZ|init
#> 4                   N|soort|ev|basis|zijd|stan
#> 5                                     VG|neven
#> 6                 VNW|pers|pron|nomin|vol|1|ev
#> 7                                WW|pv|verl|ev
#> 8                                       VZ|fin
#> 9                                          LET
#> 10 VNW|bez|det|stan|vol|1|ev|prenom|zonder|agr
#> 11                  N|soort|ev|basis|zijd|stan
#> 12                                         LET
#> 13 VNW|bez|det|stan|vol|1|ev|prenom|zonder|agr
#> 14                  N|soort|ev|basis|zijd|stan
#> 15                                    VG|neven
#> 16                     ADJ|prenom|basis|zonder
#> 17                   N|soort|ev|basis|onz|stan
#> 18                                         LET
#>                                  feats head_token_id      dep_rel deps
#> 1       Case=Nom|Person=1|PronType=Prs             2        nsubj <NA>
#> 2  Number=Sing|Tense=Past|VerbForm=Fin             0         root <NA>
#> 3                                 <NA>             4         case <NA>
#> 4               Gender=Com|Number=Sing             2          obl <NA>
#> 5                                 <NA>             7           cc <NA>
#> 6       Case=Nom|Person=1|PronType=Prs             7        nsubj <NA>
#> 7  Number=Sing|Tense=Past|VerbForm=Fin             2         conj <NA>
#> 8                                 <NA>             7 compound:prt <NA>
#> 9                                 <NA>             7        punct <NA>
#> 10               Person=1|PronType=Prs            11    nmod:poss <NA>
#> 11              Gender=Com|Number=Sing             7          obj <NA>
#> 12                                <NA>            14        punct <NA>
#> 13               Person=1|PronType=Prs            14    nmod:poss <NA>
#> 14              Gender=Com|Number=Sing            11         conj <NA>
#> 15                                <NA>            17           cc <NA>
#> 16                          Degree=Pos            17         amod <NA>
#> 17             Gender=Neut|Number=Sing            11         conj <NA>
#> 18                                <NA>             2        punct <NA>
#>               misc
#> 1             <NA>
#> 2             <NA>
#> 3             <NA>
#> 4             <NA>
#> 5             <NA>
#> 6             <NA>
#> 7             <NA>
#> 8    SpaceAfter=No
#> 9             <NA>
#> 10            <NA>
#> 11   SpaceAfter=No
#> 12            <NA>
#> 13            <NA>
#> 14            <NA>
#> 15            <NA>
#> 16            <NA>
#> 17   SpaceAfter=No
#> 18 SpacesAfter=\\n

Created on 2020-09-14 by the reprex package (v0.3.0)

jwijffels commented 3 years ago

If you just want the model to be stored in a file with your name in the tempdir, why don't you just do

f <- udpipe_download_model("english", model_dir = tempdir(), overwrite = FALSE)
temp_file <- tempfile()
file.copy(from = f$file_model, to = temp_file)

model <- udpipe_load_model(temp_file)
x <- setNames(c("Just showing an example", "And another one"), c("a", "b"))
udpipe(x = x, object = model)

I see you are basically interested in textrecipes in the tokenizer part as you recently added tokenizers.bpe. Just a note: R package sentencepiece is a good candidate for tokenisation in textrecipes as well Next - R package word2vec is great for wordvectors instead of using Glove. And even crfsuite for named entity recognition. But all these require of course building a model - which is not what textrecipes is about - seems to me more like tooling to prepare your data for modelling.

jwijffels commented 3 years ago

And another note when looking at your code at https://github.com/tidymodels/textrecipes/blob/9fd38e2f68c4cd5fe448613841cdd409cc111e56/R/tokenizer-udpipe.R I don't know when the function generator in textrecipes is used but know that if you use udpipe::udpipe by default it looks up an already loaded model in .loaded_models https://github.com/bnosac/udpipe/blob/master/R/pkg.R#L11 such that it does not need to be loaded again. You code seems to be have to be reloading each time the model when the function generator will be called.

EmilHvitfeldt commented 3 years ago

I need the file to be stored in the R object itself. Then I simply write it to a temp file right before udpipe tries to load it. Works great now.

The way I make it work was by having the user explicitly pass the model to step_tokenize()

library(textrecipes)
library(recipes)
library(modeldata)
data(okc_text)

library(udpipe)

udmodel <- udpipe_download_model(language = "english")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe to /private/var/folders/m0/zmxymdmd7ps0q_tbhx0d_26w0000gn/T/RtmpiJQ34o/reprexf49131fb9589/english-ewt-ud-2.4-190531.udpipe
#> Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details

aa <- recipe(~ essay0, data = okc_text) %>%
  step_tokenize(all_predictors(), 
                engine = "udpipe", 
                training_options = list(model = udmodel)) %>%
  prep() %>%
  bake(new_data = NULL)

aa$essay0
#> <textrecipes_tokenlist[750]>
#>   [1] [227 tokens] [41 tokens]  [458 tokens] [87 tokens]  [90 tokens] 
#> ...
#> [746] [146 tokens] [72 tokens]  [346 tokens] [119 tokens] [56 tokens] 
#> # Unique Tokens: 9965

Created on 2020-09-15 by the reprex package (v0.3.0)

Thank you, both {sentencepiece} and {word2vec} is on the shortlist for next release.

jwijffels commented 3 years ago

👍