dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

POS tagger #77

Closed dselivanov closed 7 years ago

dselivanov commented 8 years ago

I'm thinking about POS tagger. At the moment we are limited to openNLP package, which is wrapper for Apache OpenNLP and relies on java/rJava.

Would be great to have POS tagger without such dependencies. Worth to check spaCy implementation - A Good Part-of-Speech Tagger in about 200 Lines of Python. Should not be too hard to implement in C++.

dselivanov commented 8 years ago

From #99

thoughts by pommedeterresautee:

I agree that building a pos tagger would imply lots of work. And it would probably change the spirit of text2vec I'm some way. Why not leverage the one from tree tagger? I never used it but POS Tagging seems to be a technology mainly mastered (I mean out of corner cases). They have models for several languages and an interface for R. Languages managed: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

sharing by dominiqueemmanuel:

For my part, I'm using Freeling (wrapped in R) for POS Tagging and lemmatization. http://nlp.lsi.upc.edu/freeling/node/1 My wrapper is very very simple:

  • I suppose Freeling is correctly installed:
apt-get install libboost-regex-dev 
apt-get install libicu-dev 
apt-get install  zlib1g-dev
apt-get install libboost-system-dev 
apt-get install libboost-program-options-dev 
apt-get install libboost-thread-dev
apt-get install libboost-filesystem-dev
wget https://github.com/TALP-UPC/FreeLing/releases/download/4.0/freeling-4.0-xenial-amd64.deb
dpkg -i freeling-4.0-xenial-amd64.deb
  • I have written a function which takes a string vector (to be Pos Tagged / Lemmatized) : this function read and write some files, and make a system call of the main Freeling instruction (analyze). The results are returned in R (see examples). Here is the code :
lemmatisation <- function(txt, lang = "fr") {
  ## Name of the input file
  testtxt_in_name <- tempfile()
  ## The input file (in order to specify the encoding)
  testtxt_in <- file(testtxt_in_name, encoding = "UTF-8")
  ## write the input string vector in the input file
  writeLines(txt, testtxt_in)
  close(testtxt_in)
  ## Name of the output file
  testtxt_out_name <- tempfile()
  ## The output file (in order to specify the encoding)
  testtxt_out <- file(testtxt_out_name, encoding = "UTF-8")
  ## Defintion of the command (TODO : check if this instruction
  ## is cross platform)
  command <- paste0("analyze -f ", lang, ".cfg <", testtxt_in_name, 
    " ", testtxt_out_name)
  ## system call (TODO : check if this instruction is cross
  ## platform)
  res <- system(command)
  ## For any error, the function return NULL
  if (res != 0) 
    return(NULL)
  ## Reading the output file
  out <- readLines(testtxt_out_name, encoding = "UTF-8")
  ## Return
  return(out)
}

NB: if you adapt the system call with some options, you can get more from freeling, in particular more detailled POS Tagging (see the manual) (I've chosen simple options because I'm focused on lemmatization).

  • Some examples :
lemmatisation(c("une première phrase",
                "une seconde phrase étudiée",
                "avec un mot inexistant xdsd")
              )
# [1] "une un DI0FS0 0.97167"             "première premier AQ0FS00 0.874669"
# [3] "phrase phrase NCFS000 0.960317"    "une un DI0FS0 0.97167"
# [5] "seconde 2 AO0FS00 0.585271"        "phrase phrase NCFS000 0.960317"
# [7] "étudiée étudier VMP00SF 1"         "avec avec SP 0.999892"
# [9] "un un DI0MS0 0.956036"             "mot mot NCMS000 1"
# [11] "inexistant inexistant AQ0MS00 1"   "xdsd xdsd NCMS000 0.489622"
# [13] ""
lemmatisation(c("a studied sentence","more words"),lang="en")
# [1] "a a DT 0.998827"               "studied study VBN 0.640845"
# [3] "sentence sentence NN 0.989071" "more more DT 0.339848"
# [5] "words word NNS 0.998188"       ""

Of course this wrapped can be improved. In particular, I realize that the returned vector doesn'r preserve the length of the input (the main reason is to make just one call of freeling) => this should be corrected.

remark by pommedeterresautee:

Before anything, keep in mind I am not an IT law expert and right now I have no access to a computer so can't go deeper in my analysis. Freeling is under Affero GNU license. This is a very specific license which tends to contaminate the whole project it is included into (I think it means text2vec and the project where text2vec is used). As text2vec is under MIT license, it's something to keep in mind. Please, don't hesitate to check exactly the consequences. I just write it here now so if I am right any decision taken is done with all information. Not related at all, some python code which implement the phrase detector from word2vec. May be easier to convert to R than pure C. https://github.com/travisbrady/word2phrase

randomgambit commented 7 years ago

hello! any updates on this? thanks!

dselivanov commented 7 years ago

Unfortunately not. I have some drafts, but they are far from being finished. We need a hero who can spend some time on this task.

randomgambit commented 7 years ago

I nominate @dselivanov ! :D

jwijffels commented 7 years ago

Maybe this is a solution. It's a wrapper around UDPipe: https://github.com/bnosac/udpipe. Currently only has Rcpp and data.table as dependency. Models are available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364 (see https://ufal.mff.cuni.cz/udpipe)

dselivanov commented 7 years ago

@jwijffels super, will try !

randomgambit commented 7 years ago

@dselivanov @jwijffels this package is GREAT!!!! cannot wait to have it included in text2vec.

dselivanov commented 7 years ago

I've tried UDPipe - looks great and it is quite comprehensive. One issue I found so far - it is pretty slow ~ 1.7k tokens per second per core. Which is not that good...

pommedeterresautee commented 7 years ago

Which task?

dselivanov commented 7 years ago

That number is full pipeline - tokenization, pos tagging, lemmatization, dependency parsing. There is no option to exclude something.

@pommedeterresautee check paper for details, they report roughly same numbers. Also authors mention 6.5k tokens/per second/per core for tokenization and lemmatization.

randomgambit commented 7 years ago

huuum... I wonder if this is really a problem. At the end of the day, POS tagging is just some pre-processing of the data. Once its done, its done

dselivanov commented 7 years ago

It is! But still try to apply it to toy movie-review dataset from text2ve. It will take a lot: ~ 15 minutes.

Anyway this package seems the best at this moment!

27 авг. 2017 г. 16:52 пользователь "Olaf" notifications@github.com написал:

huuum... I wonder if this is really a problem. At the end of the day, POS tagging is just some pre-processing of the data. Once its done, its done

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/77#issuecomment-325196489, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3ZUR0FrEzdm8hSXb69Eo72oL__ztks5scWZ5gaJpZM4H13rx .

randomgambit commented 7 years ago

@dselivanov have you checked https://cran.r-project.org/web/packages/openNLP/index.html ? seems super complete and well known but udpipe more user friendly

dselivanov commented 7 years ago

On other side I see spacy is ~10k words per second which is not much better. https://twitter.com/spacy_io/status/901776805857808384 I've seen openNLP - it Java based with all consequences.

randomgambit commented 7 years ago

@dselivanov what are these consequences? :) I like how easily I got a tidy-dataframe out of udpipe

dselivanov commented 7 years ago

Pain with installation, memory consumption, etc. Also it doesn't look like widely used in industry.

randomgambit commented 7 years ago

@dselivanov OK got it. so udpipe good, but never heard of it before. How serious/good is the tagging? also on a side note, POS is quite an unfortunate acronym hahaha

randomgambit commented 7 years ago

I ll do some testing on a text data I have when the subjects are tagged manually, and see how it compares. But my feeling is that the package is good

randomgambit commented 7 years ago

easy for quick testing http://lindat.mff.cuni.cz/services/udpipe/

pommedeterresautee commented 7 years ago

For what it worth first time I see a lemmatizer in French working without dirty code... Just for that it s interesting (and it works for many other languages).

randomgambit commented 7 years ago

oui mon cher pomme frite. I agree with that

jwijffels commented 7 years ago

Benchmarking the speed of the R interface to the following text annotators is also on my todo list.

To my knowledge, there aren't any taggers however with this scope (no dependencies on Java/Python, multiple languages possible)

I'll add the option to leave out some parts of the annotation process (e.g. dependency parsing or the pos tagger/lemmatiser) and some more options related to the tagging before trying to release to cran (pkg libs folder is 27Mb in size so I'm hoping that cran will be liberal on this).

In the next release training models based on CONLL-U data would be included so that you can work directly on top of the data from universaldependencies.org instead of relying on CC-BY-NC-SA models.

How do you plan to include an pos tagger/lemmatizer/dependency parser inside text2vec. It is as part of a process flow or rather functionalities on top of already annotated tidy data?

dselivanov commented 7 years ago

@jwijffels I don't see reason to create dependency on POS tagging. text2vec syntax is flexible enough to allow incorporate whole pipeline into preprocessing/tokenization step.

What is make sense is to create tutorial/documentation on how to use udpipe together with text2vec. What do you think?

jwijffels commented 7 years ago

Regarding integration, I don't think text2vec should depend on any annotation tagging package but indeed have a flexible solution to allow for POS tagged results to integrate. As POS tagging & dependency parsing always take quite some time, I tend to normally run it once or launch things as triggers in a database to store the data. This is my preferred way of doing the analysis instead of putting it into a pipeline as otherwise you'll have to wait more on the result of the tagger as you already indicated using your toy dataset. Most of the time, you rerun several times different topic models or whatever model based on the same POS tagged dataset. And tagging is something you normally do only once, well at least if your text data is not that messy and you require different tagged runs for different cleaned text.

A tutorial is certainly a good thing. I think the following tutorials should be good

jwijffels commented 7 years ago

FYI. The udpipe R package was released on CRAN last Friday (01/09/2017): https://CRAN.R-project.org/package=udpipe

dselivanov commented 7 years ago

Congrats! closing the ticket. For future will need to write couple of tutorials as discussed above.

randomgambit commented 6 years ago

hi @dselivanov any update on this? do you have some tutorials in mind? I was reading the comments above and you mention "NLPs used in the industry". Which ones are you talking about? CoreNLP?