HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.95k stars 539 forks source link

Inconsistent results with current deepdive version. #629

Open ghost opened 7 years ago

ghost commented 7 years ago

Hi, I am currently running the last version of deepdive on and EC2 instance (r3.xLarge) with ubuntu 16.04. I am using the spouse example with a 5000 lines subset of the signalmedia I created. signalmedia-5k.jsonl.gz I noticed I don't have the same results number of lines in the database each time I run the the process. Here are my results : Run Articles Sentences person_mention spouse_feature 1 5000 92074 39455 53618035 2 5000 92070 39452 53635745 3 5000 92069 39470 53636191 4 5000 92071 39457 53637198

I am about to write a python script that will compare the different databases to know what the differences are.

ghost commented 7 years ago

I looked at the differences in the tables. I noticed that even when I have the same number of rows between two sentences tables they don't contain the same results. I looked quickly at the differences and nothing obvious pops up. In one case it considers \u2013 as a group of four numbers instead of a single character. In another number nothing is clear in what the difference is.

ghost commented 7 years ago

Ok. I did multiple runs on my 5k spouse test to get different sentences results. I also created a small python program to compare the results. From what I see when I have differences it is always the last lines. It seems there is a reuse of a string variable somewhere and when there is a document that is smaller than the previous document then deepdive or coreNLP mixes both documents by adding the end of the big document to the small document.

ghost commented 7 years ago

I pushed further my tests. I think there is an issue with the CoreNLP. I have runs where in the same phrase of the same document words are cut in half such as "available" in one run and "availa", "ble" in the other run.

manning commented 7 years ago

Did you ever solve this? Can you provide a document where this can be directly reproduced just calling CoreNLP?

ghost commented 6 years ago

Hi Manning, we didn't solve this. We haven't tried to reproduce it by calling CoreNLP directly but we will probably do it in the future since we are using it.