Closed kylepjohnson closed 5 years ago
@kylepjohnson
Shall I volunteer for this one? We're currently evaluating spaCy against the pythonized coreNLP at work, because we have a pile of clojure/coreNLP code in production that we might be extending if not replacing with the new libs, so I'm getting some exposure.
Early review is that performance is fairly lousy compared to the JVM classes (not all of which are neural, btw).
Yes please do take a stab at it.
@kylepjohnson Hey, so what's on your mind about this one? Let's think, say, in terms of the depparsers. There's a whole data structure that comes back, to represent the tree, made up of various classes. It seems like a lot of work to wrap all of those. So instead do we simply expose these stanford objects and let users handle them? Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?
So instead do we simply expose these stanford objects and let users handle them
In word, yes, this was my initial intention. Just to return the object with minimal processing. This is a minimum requirement it seems.
Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?
I certainly see value in this, too! And honestly for everything but dep tags the task doesn't seem to hard (or am I overlooking something?). Even just writing the dep tags into our own namespace isn't so bad -- the tough part would be making them meaningful for non-programmers (eg, nice linkng of nodes, prints of tree). Am on phone can't give good code example. I'm shooting off the hip on some of this, so please share your full thoughts about how you would go about it.
.
Well, the APIs for the pure python versions are (at present) fairly basic. They're here: https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/pipeline/doc.py
Pure python here means PyTorch neural models, with some extra bits. One can also use the full CoreNLP via Python interfaces to a Java server, but here we want to skip all that.
So I think it does make sense to offer some nicer APIs for trees. I'll propose something in a few.
@free-variation I am going to open a new issue for bringing your xml tree to the stanfordnlp dp
Newly released Python library from Stanford. Seems like a replacement for their neural net Java tools from a few years back.
In-scope languages for which they have models:
Others (Arabic, Hebrew) might be worth a shot, but preferably we'd do some performance testing on how they do on pre-modern texts.
A few highlights:
most important is the dependency parsing this can do. This has been long awaited and have a huge impact on scholarship, I am sure.
Dependency: https://stanfordnlp.github.io/stanfordnlp/depparse.html
POS: https://stanfordnlp.github.io/stanfordnlp/pos.html
Lemmas: https://stanfordnlp.github.io/stanfordnlp/lemma.html
General usage (which is not very thorough with examples): https://stanfordnlp.github.io/stanfordnlp/pipeline.html#usage
"Data objects": https://stanfordnlp.github.io/stanfordnlp/data_objects.html#word. I like this idea of defining all the possible "things" we can output, as we have offerings that are a little broader, including scansion units, phonetic units, and so on.
The big, soul-searching questions I see rising from this:
stanfordnlp.Pipeline()
returns are carefully parsed and cast into the CLTK pipeline's object paradigm. Seeing that their data objects are very minimalistic, the load is lightened.