cegme / gatordsr

University of Florida Trec KBA code and more
3 stars 0 forks source link

Chunking #27

Open mshahriarinia opened 11 years ago

mshahriarinia commented 11 years ago

Looks like Stanford NLP does not provide chunking: here. That is why we got non-chunked noun-ohrases as triples in the pipeline output. The only java library was the "Mark Greenwood's Noun Phrase Chunker" downloadable from here but it doesn't seem to be maintained. I have tested NLTK and it does chunking.

I checked several resources regarding this like here, etc but none of them refer to Stanford NLP chunking feature.

Any thoughts? @cegme @SunPHM

cegme commented 11 years ago

This chunking does need to be done. @SunPHM Is this on your radar?

peng51 commented 11 years ago

I think Morteza refers to the noun phrase chunking. What kind of problem does the noun phrase chunking deal with?

peng51 commented 11 years ago

If we only need noun phrases, it should be very simple to use Stanford POS tagging. For more complex chunking, we can use the OpenNLP chunking, http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.chunker

cegme commented 11 years ago

Lets have a simple chunking format for now. We need the triple results to look more sensible. Also check out http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf

cegme commented 11 years ago

@SunPHM I just checkout the link you posted, first use the method that would be faster to implement.

peng51 commented 11 years ago

It seems that LingPipe has phrase chunker too http://alias-i.com/lingpipe/demos/tutorial/posTags/read-me.html NLTK is a python library, while OpenNLP and LingPipe both are java libraries.

mshahriarinia commented 11 years ago

In triple generation we get e.g. @, the, a, +, of, ... and lots of non-noun phrase or noun chunks