Stream Processing - Githubissues

peng51 commented 11 years ago

We want to use Stanford CoreNLP packages to do the linguistic processing. We need to find out how to implement the pipeline with multiple stages and how to use filters to accelerate the processing, the input/output of each stage, how to train/test the pipeline especially the filters.

peng51 commented 11 years ago

Need to add one more package for Stanford NLP, the stanford-corenlp-1.3.4-models.jar

peng51 commented 11 years ago

Resolved the models library issue. Morteza, each time the bulid.sbt changes, we need to use sbt eclipse to change the Eclipse project profile and refresh the project in Eclipse.

cegme commented 11 years ago

I added a library dependency that added google guava libraries to the code.

Now we can easily use a bloom filter, here is an example:


val bf = com.google.common.hash.BloomFilter.create(com.google.common.hash.Funnels.stringFunnel, 10, .001)

bf.put("hello") // adds an item to the bloomfilter

bf.mightContain("hello")

We can talk about how to use it if you wish.

We just need to iterate over the example relations found from the clue web corpus And use the bloom filter to check for presence of that resources folder. We could probably serialize that object and load it at run time.

@SunPHM Alternatively, We may be able to simply add the Reverb libraries libraries and use that directly.

cegme commented 11 years ago

@SunPHM I added a function to let you extract all the relations found in a string. https://github.com/cegme/gatordsr/blob/master/code/src/main/scala/edu/ufl/cise/util/RelationChecker.scala

RelationChecker.WikiRelations(testSentence). It returns an Iterator[String] that are proper relations. There are ways to return a "word" object to help find the entity but I figured this would help you.

peng51 commented 11 years ago

@cegme I tried the Faucet. I have some questions: (1) How to get the texts from the Item.body? (2) Where to find the entity/slot pairs that you mentioned to me yesterday?

cegme commented 11 years ago

You can get the text from the item body by doing something like:

val body = new String(si.body, "UTF-8")

Beware of the UnsupportedEncodingException. This is just the [String constructor](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[], java.nio.charset.Charset)

Also, yesterday I mentions the SSFQuery class. This represents the query and the two slots we are given.

peng51 commented 11 years ago

I have used some random pages from wikipedia to test the pipeline. However the results are not promising. First, it finds many non-interesting relations, second, some relations don't have the right entities. So, I have thought about it and maybe we can change the algorithm in the following ways: (1) We can use Washington University reverb to extract relations. And from what I have tried, reverb doesn't output all the interesting relations. We can also try to find other exsiting libraries. (2) We can try to improve our algorithms (which is not very clear how to do it. I have some ideas, but I don't know whether they will work or not) (3) Instead of finding relations, we may try an opposite way to test whether the query's relation exists in one sentence or not. I think there are some problems associated with the two directions: (1) How to use the processing results of Stanford NLP to do relation extraction or relation checking (2) How to associate the results with confidence, what kind of model we want to train? (3) Are the filters in the pipeline necessary, how we can design a better pipeline?

cegme commented 11 years ago

@SunPHM We want to get a first version up as soon as possible. We should have had the first base line up. Please commit your current code and we will do a code review and we will tag that as baseline one.

Optimizations will come later.

peng51 commented 11 years ago

I have added the Pipeline.scala. For the time being, you need to first initialize the annotators by using:

Pipeline.init()

Then for example

val text = "Abraham Lincoln was the 16th President of the United States, serving from March 1861 until his assassination in April 1865."
Pipeline.run(text)

The run method will output relations.

cegme commented 11 years ago

We are already able to extract relations from streams! Closing this issue.

cegme / gatordsr

Stream Processing #10