amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
468 stars 116 forks source link

Lemmatization #31

Open etrain opened 9 years ago

etrain commented 9 years ago

May be unnecessary for Release 0.1

ngarneau commented 8 years ago

Hello, Are you still planning to create a node for Lemmatization since it's already provided into CoreNLPFeatureExtractor? (same for NER, POS Tagging)

etrain commented 8 years ago

Hi there,

There are a couple of weaknesses with our current use of CoreNLP 1) Performance: While CoreNLP is pretty quick, it does take some time to initialize and given the library's structure it makes sense to batch as many analyses as you can into a single pass over a document. This is the strategy we take in CoreNLPFeatureExtractor. 2) Licensing: CoreNLP is licensed GPLv3 - we are linking against it and not selling KeystoneML as proprietary software, so this is fine, but this may not acceptable for all of our users. As a result, I'd like to limit reliance on CoreNLP going forward.

I'm unfamiliar with the current state-of-the-art in lemmatization, but if there's a JVM-based implementation of standard techniques that is both 1) business-friendly in licensing (Apache or BSD preferred), and 2) reasonably high performance, I'd be interested in seeing it integrated with KeystoneML.

Alternatively, if you want to take a shot at implementing something like this as the first step in more extensive NLP support, we'd welcome such a PR.

ngarneau commented 8 years ago

Hey Evans,

I took a quick look for any JVM library that would be interesting and found Epic, written in Scala; https://github.com/dlwh/epic which is under the Apache License, Version 2.0.

As I could see from the repo you are already using Breeze from ScalaNLP and Epic is a sub project of ScalaNLP.

They already have a NER and POSTagger implemented, and to implement a Lemmatizer we'd need the POSTagger anyway so we could build up on that.

I'll take a deeper look in the next days if the implementation is near state of the art and if you think it'd be interesting.

Cheers

etrain commented 8 years ago

the ScalaNLP stuff is great - and comes out of David Hall/Dan Klein's work, so I expect it to be quite modern. Kind of a bummer that it doesn't include lemmatization out of the box. It would be good to get a sense of how it performs vs. CoreNLP (both from a statistical and throughput perspective).

That said - for basic lemmatization we might consider taking a page from the Python nltk playbook and using WordNet. JWI (http://projects.csail.mit.edu/jwi/) provides an alternative. It is licensed CC-BY 4.0.