Open GoogleCodeExporter opened 9 years ago
It's described in the draft of my PhD thesis:
http://nlp.fi.muni.cz/~xpomikal/phdthesis-draft.pdf
In a nutshell: The near-duplicates are identified by matching n-grams of tokens
(5-grams by default). The algorithm is designed to be scalable. The largest
data we have applied it to so far was a 10 billion words corpus (119 GB file).
The processing on a medium range server took ca 12 hours, required less than
10GB RAM and only used a single CPU core. The time includes precomputing
duplicate n-grams using hashgen and hashdup (see `man hashgen' and `man
hashdup'). The same computation without precomputing the duplicate n-grams took
only about 6 hours but required ca 60GB RAM.
When I have more time I will extract relevant information from the thesis and
put it on the wiki.
Original comment by jan.pomi...@gmail.com
on 12 May 2011 at 2:38
The link to the thesis draft from the previous post is no longer valid. Here is
a link to the final text:
http://is.muni.cz/th/45523/fi_d/phdthesis.pdf
Original comment by jan.pomi...@gmail.com
on 20 Dec 2014 at 8:29
Original issue reported on code.google.com by
tur...@gmail.com
on 7 May 2011 at 5:36