Design wrapper for StanfordNLP code base

kylepjohnson commented 5 years ago

Newly released Python library from Stanford. Seems like a replacement for their neural net Java tools from a few years back.

Homepage: https://stanfordnlp.github.io/stanfordnlp/
Language modules: https://stanfordnlp.github.io/stanfordnlp/installation_download.html#human-languages-supported-by-stanfordnlp

In-scope languages for which they have models:

Ancient Greek, Perseus
Ancient Greek, PROEIL
Latin, ITTB
Latin, Perseus
Latin, PROEIL
Old Church Slavonic, PROIEL
Old French, SRCMF

Others (Arabic, Hebrew) might be worth a shot, but preferably we'd do some performance testing on how they do on pre-modern texts.

A few highlights:

most important is the dependency parsing this can do. This has been long awaited and have a huge impact on scholarship, I am sure.
Dependency: https://stanfordnlp.github.io/stanfordnlp/depparse.html
POS: https://stanfordnlp.github.io/stanfordnlp/pos.html
Lemmas: https://stanfordnlp.github.io/stanfordnlp/lemma.html
General usage (which is not very thorough with examples): https://stanfordnlp.github.io/stanfordnlp/pipeline.html#usage
"Data objects": https://stanfordnlp.github.io/stanfordnlp/data_objects.html#word. I like this idea of defining all the possible "things" we can output, as we have offerings that are a little broader, including scansion units, phonetic units, and so on.

The big, soul-searching questions I see rising from this:

We should face the fact that this tool could be so powerful that we might want to make significant alterations to our own pipeline, in order to accommodate it.
How do we merge our main NLP object and pipeline with theirs. ML parsers only work for languages with treebanks, and there are important NLP tasks which are not annotated in treebanks -- meaning there is still an important need for the CLTK. This issue might be solved by having a firm definition of our own "data objects" and a functional pipeline of our own. Then, their pipeline is just one step in our own. And the results of what stanfordnlp.Pipeline() returns are carefully parsed and cast into the CLTK pipeline's object paradigm. Seeing that their data objects are very minimalistic, the load is lightened.
This is Pytorch and if we want to use it we simply have to. Within the CLTK, we would want to consider whether we make this the preferred DL library or we also allow Keras (whose elegance will make for easier code maintenance). There's an enormous post on this issue at reddit ML on this topic: https://www.reddit.com/r/MachineLearning/comments/6bicfo/d_keras_vs_PyTorch/.

free-variation commented 5 years ago

@kylepjohnson

Shall I volunteer for this one? We're currently evaluating spaCy against the pythonized coreNLP at work, because we have a pile of clojure/coreNLP code in production that we might be extending if not replacing with the new libs, so I'm getting some exposure.

Early review is that performance is fairly lousy compared to the JVM classes (not all of which are neural, btw).

kylepjohnson commented 5 years ago

Yes please do take a stab at it.

free-variation commented 5 years ago

@kylepjohnson Hey, so what's on your mind about this one? Let's think, say, in terms of the depparsers. There's a whole data structure that comes back, to represent the tree, made up of various classes. It seems like a lot of work to wrap all of those. So instead do we simply expose these stanford objects and let users handle them? Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?

kylepjohnson commented 5 years ago

So instead do we simply expose these stanford objects and let users handle them

In word, yes, this was my initial intention. Just to return the object with minimal processing. This is a minimum requirement it seems.

Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?

I certainly see value in this, too! And honestly for everything but dep tags the task doesn't seem to hard (or am I overlooking something?). Even just writing the dep tags into our own namespace isn't so bad -- the tough part would be making them meaningful for non-programmers (eg, nice linkng of nodes, prints of tree). Am on phone can't give good code example. I'm shooting off the hip on some of this, so please share your full thoughts about how you would go about it.

.

free-variation commented 5 years ago

Well, the APIs for the pure python versions are (at present) fairly basic. They're here: https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/pipeline/doc.py

Pure python here means PyTorch neural models, with some extra bits. One can also use the full CoreNLP via Python interfaces to a Java server, but here we want to skip all that.

So I think it does make sense to offer some nicer APIs for trees. I'll propose something in a few.

kylepjohnson commented 5 years ago

@free-variation I am going to open a new issue for bringing your xml tree to the stanfordnlp dp

cltk / cltkv1

Design wrapper for StanfordNLP code base #5