brendano / stanford_corenlp_pywrapper

151 stars 59 forks source link

From json to python classes? #26

Closed AbeHandler closed 9 years ago

AbeHandler commented 9 years ago

In using this python wrapper, I am writing a ton of code that parses/sorts thru the json output.

For instance, I'm doing this to get the part of speech tokens from the sentence['parse']

def get_tokens(parse_string):
    pattern = "((?<=\()([A-Z]+\$?) [^)^()]+(?=\)))"
    return re.findall(pattern, parse_string)

returns

(u'PRP$ Its', u'PRP$'), (u'RB not', u'RB'), (u'DT a', u'DT'), (u'NN reach', u'NN'), (u'TO to', u'TO'), (u'VB     say', u'VB'), (u'NNP New', u'NNP'), (u'NNP Orleans', u'NNP'), (u'MD may', u'MD'), (u'VB be', u'VB'), (u'NNPS Americas', u'NNPS'), (u'RBS most', u'RBS'), (u'JJ counterintuitive', u'JJ'), (u'NN city', u'NN')

It might be worth it to put another layer on top of the json that turns it into real python classes. The issue is that if the json changes the added layer might break -- but I think that is probably something that could be handled with lots of unit testing. Seems worth it to build -- even as an external library? Right? Going from Stanford NLP to python object seems really useful.

brendano commented 9 years ago

hm, you should use the "pos" and "tokens" keys to get POS tags and their associated words. The parse is in there only if you need the nested constituency structure.

yeah, a layer on top is needed for many purposes. the problem i've found is it's never clear what the right structure of the classes should be. stanfordnlp chose one (those weird class-keyed maps) and i think it's god-awful hard to use. in my experience different projects require different types of NLP annotations and different structures for them. for example, i do dependency graph traversal a lot so i have a bit of code that does the by-governor and by-child indexing. but lots of people dont care about that so it would just be a code maintenance burden. that's why i'm a little hesitant to enshrine this sort of thing at the level of a library because it just has to change all the time and thus bad to depend on.

that said i have a bunch of utilities that do stuff on top of this json data format such as

brendano commented 9 years ago

oh i didnt answer your question. yes it would be useful to build a decently tested library that made real python classes out of things and had useful support code for development. in particular, console output is a big one. i think the best way is to focus on a single project at a time and not spend too much time making a battletested and highly general codebase, just because you might need to dramatically shift it in the future. so i guess that's why i like these json data structures as a lowest common denominator, though they're a bit lower level than what one would ideally like to use.

AbeHandler commented 9 years ago

Whoops. The 'pos' is certainly way easier than using a regex on the parse. Thanks for pointing that out.

I see your point about supporting versatility w/ just the raw json. I might end up coding an external wrapper for my particular project (I keep needing documents that have sentences that have tokens that have a POS/lemmatized form) with hopes of reusing it -- but I see how it would get complex to support lots of python classes for anything anyone might want to do with stanford nlp.

brendano commented 9 years ago

for your use case: all tokens have lemmas in the "lemmas" key ... are you saying particular POSes are needed?

AbeHandler commented 9 years ago

For my use case: I want to do n-gram counting using the lemmatized form. And I want an POS mask to filter out certain ngrams. So I need to know: what are all of the possible word ngrams for a sentence? And then, for each possible ngram, what is the POS (for filtering) and lemmatized form (for counting) of each token.

Btw, this is all coming up because I am trying to apply the ngram counting code I built for the web app to counting ngrams around a NER. It is taking longer than it should -- so I started thinking of how to reuse more of the code.