Constannnnnt / Distributed-CoreNLP

This infrastructure, built on Stanford CoreNLP, MapReduce and Spark with Java, aims at processing documents annotations at large scale.
https://github.com/Constannnnnt/Distributed-CoreNLP
MIT License
0 stars 0 forks source link

To Do: #8

Closed ji-xin closed 5 years ago

ji-xin commented 5 years ago

I have finished setting up the framework for Spark and CoreNLP, including the annotator dependency stuff.

What we need to do:

  1. Take a look at CoreNLP.java, line 36-50. Those are the functionalities we need to implement (at least most of them).
  2. Same file, line 155-183, this is the major part. Each line is read in as a pair (Index, Line), and the expected output is an iterator of Array: [((Index, Func0), Output0), ..., ((Index, FuncN), OutputN))]. I have implemented two (easiest) functionalities and the others should be done in the same way. Don't underestimate this. StanfordNLP is very badly documented and this can be nasty.
  3. After these functionalities have been done, at line 184 (after the flatMap), group by functionalities and then sort by Index. The output of different functionalities should go into different folders, but the order of sentences must be unchanged in each folder.
ji-xin commented 5 years ago

Also, always use spaces instead of tabs to indent.

Constannnnnt commented 5 years ago

I will try sentiment, cleanxml, ssplit, coref, dcoref this weekend.

KaisongHuang commented 5 years ago

I have claimed my tasks. BTW, please pull the master branch before making new changes.