Open RinkeHoekstra opened 8 years ago
FWIW the VOID file we're talking about here is 29MB in size.
(This should actually be in LOD-Laundromat/LOD-Laundromat I guess)
For context, see this query. 5767 dinstct source IRIs with a hash. When the hash is removed, these we've got 141 distinct IRIs. Indeed, quite some overlap
For instance: http://rdfdata.eionet.europa.eu/eurostat/void.rdf#env_rwat_rbd
This URI appears in very many documents that are all the same. This is because, I guess, you crawl the web for every URI that occurs in the document
http://rdfdata.eionet.europa.eu/eurostat/void.rdf
. Because the hash-uri's are all different, you apparently crawl the same file as many times as there are hash-uri's in the file.This potentially creates a significant number of duplicates to the LOD laundromat collection.
Potential solution is to see whether the hash-uri without the QName is the same as the base-uri of the document being crawled. If it is, stop crawling. If it isn't, just continue crawling...