Brief documentation? - Githubissues

GoogleCodeExporter commented 9 years ago

Could you provide some brief documentation:

* What matching algorithm is used?

* How well does it scale?

at the very least.

Original issue reported on code.google.com by tur...@gmail.com on 7 May 2011 at 5:36

GoogleCodeExporter commented 9 years ago

It's described in the draft of my PhD thesis:
http://nlp.fi.muni.cz/~xpomikal/phdthesis-draft.pdf

In a nutshell: The near-duplicates are identified by matching n-grams of tokens 
(5-grams by default). The algorithm is designed to be scalable. The largest 
data we have applied it to so far was a 10 billion words corpus (119 GB file). 
The processing on a medium range server took ca 12 hours, required less than 
10GB RAM and only used a single CPU core. The time includes precomputing 
duplicate n-grams using hashgen and hashdup (see `man hashgen' and `man 
hashdup'). The same computation without precomputing the duplicate n-grams took 
only about 6 hours but required ca 60GB RAM.

When I have more time I will extract relevant information from the thesis and 
put it on the wiki.

Original comment by jan.pomi...@gmail.com on 12 May 2011 at 2:38

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

The link to the thesis draft from the previous post is no longer valid. Here is 
a link to the final text:
http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

Original comment by jan.pomi...@gmail.com on 20 Dec 2014 at 8:29

Suwarnan / onion

Brief documentation? #1