LOD-Laundromat / lodlaundry.github.io

http://lodlaundromat.org
2 stars 2 forks source link

Extremely slow cleaning process #88

Closed LaurensRietveld closed 9 years ago

LaurensRietveld commented 9 years ago

For 1 particular dataset: http://lodlaundromat.org/resource/4d8d805d096f02e59ce2ef2afbc182c1

This is a synthetic (SP2B) dataset of 100 million triples. This dataset must have some strange characteristics, as its cleaning process keeps on running for more than 2 days. The cleaning process did not stall though: the cleaning file constantly increases in size little by little.

Note that two smaller sp2b datasets are crawled succesfully (see http://lodlaundromat.org/resource/fb203cdec704da4ca29fe8e1e8efee6c and http://lodlaundromat.org/resource/cb677cc0b0d3be4e5de9c9819f300cc0)

To see what kind of data this file contains, check out this dirty file with just 1000 triples: http://lrd900.d2s.labs.vu.nl/sp2b/1000.n3.gz

LaurensRietveld commented 9 years ago

the cleaning process has been refactored drastically 2 months ago. considering this done