AbeHandler / rookie

The Rookie Text Analysis System
10 stars 1 forks source link

dupe articles #240

Open AbeHandler opened 8 years ago

AbeHandler commented 8 years ago

see aug 20 1994, "Refugee Weapon Has Been Wielded Before" in cuba corpus

screen shot 2016-08-23 at 9 38 56 am

Not sure where this is coming from (penn LDC?). The dupes are as high up as /corpora/country/processed/all.anno_plus. For now now, just doing $sort all.anno_plus | uniq to fix

brendano commented 8 years ago

the LDC NYT definitely has dupes. here's my shingling code i've used before on the union of gigaword and NYT which seemed ok on spot checks:

https://github.com/brendano/OConnor_IREvents_ACL2013/blob/master/code/preproc/docdedup.py

Explanation:

We use a simple form of shingling (ch. 3, Rajaraman and Ullman, 2011): represent a document signature as its J = 5 lowercased bigrams with the lowest hash values, and reject a document with a signature that has been seen before within the same month. J was manually tuned, as it affects the precision/recall tradeoff.

that's http://infolab.stanford.edu/~ullman/mmds.html