LOD-Laundromat / Backend

This backend supports downloading new files, and adding new items to the seed list
0 stars 0 forks source link

Documents that contain hash URI's are crawled multiple times #1

Open RinkeHoekstra opened 8 years ago

RinkeHoekstra commented 8 years ago

For instance: http://rdfdata.eionet.europa.eu/eurostat/void.rdf#env_rwat_rbd

This URI appears in very many documents that are all the same. This is because, I guess, you crawl the web for every URI that occurs in the document http://rdfdata.eionet.europa.eu/eurostat/void.rdf. Because the hash-uri's are all different, you apparently crawl the same file as many times as there are hash-uri's in the file.

This potentially creates a significant number of duplicates to the LOD laundromat collection.

Potential solution is to see whether the hash-uri without the QName is the same as the base-uri of the document being crawled. If it is, stop crawling. If it isn't, just continue crawling...

RinkeHoekstra commented 8 years ago

FWIW the VOID file we're talking about here is 29MB in size.

(This should actually be in LOD-Laundromat/LOD-Laundromat I guess)

LaurensRietveld commented 8 years ago

For context, see this query. 5767 dinstct source IRIs with a hash. When the hash is removed, these we've got 141 distinct IRIs. Indeed, quite some overlap