cltk / cltkv1

Experimental repo for new API CLTK
MIT License
1 stars 5 forks source link

Design wrapper for StanfordNLP code base #5

Closed kylepjohnson closed 5 years ago

kylepjohnson commented 5 years ago

Newly released Python library from Stanford. Seems like a replacement for their neural net Java tools from a few years back.

In-scope languages for which they have models:

Others (Arabic, Hebrew) might be worth a shot, but preferably we'd do some performance testing on how they do on pre-modern texts.

A few highlights:

The big, soul-searching questions I see rising from this:

free-variation commented 5 years ago

@kylepjohnson

Shall I volunteer for this one? We're currently evaluating spaCy against the pythonized coreNLP at work, because we have a pile of clojure/coreNLP code in production that we might be extending if not replacing with the new libs, so I'm getting some exposure.

Early review is that performance is fairly lousy compared to the JVM classes (not all of which are neural, btw).

kylepjohnson commented 5 years ago

Yes please do take a stab at it.

free-variation commented 5 years ago

@kylepjohnson Hey, so what's on your mind about this one? Let's think, say, in terms of the depparsers. There's a whole data structure that comes back, to represent the tree, made up of various classes. It seems like a lot of work to wrap all of those. So instead do we simply expose these stanford objects and let users handle them? Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?

kylepjohnson commented 5 years ago

 So instead do we simply expose these stanford objects and let users handle them

In word, yes, this was my initial intention. Just to return the object with minimal processing. This is a minimum requirement it seems.

 Or we could write something much simpler to re-represent trees in a way that lets non-computer scientists manipulate them?

I certainly see value in this, too! And honestly for everything but dep tags the task doesn't seem to hard (or am I overlooking something?). Even just writing the dep tags into our own namespace isn't so bad -- the tough part would be making them meaningful for non-programmers (eg, nice linkng of nodes, prints of tree). Am on phone can't give good code example. I'm shooting off the hip on some of this, so please share your full thoughts about how you would go about it.

.

free-variation commented 5 years ago

Well, the APIs for the pure python versions are (at present) fairly basic. They're here: https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/pipeline/doc.py

Pure python here means PyTorch neural models, with some extra bits. One can also use the full CoreNLP via Python interfaces to a Java server, but here we want to skip all that.

So I think it does make sense to offer some nicer APIs for trees. I'll propose something in a few.

kylepjohnson commented 5 years ago

@free-variation I am going to open a new issue for bringing your xml tree to the stanfordnlp dp