bbcarchdev / spindle

RES Linked Open Data aggregation engine
https://bbcarchdev.github.io/spindle/
Apache License 2.0
2 stars 1 forks source link

Some resources result in a very large set of precomputed quads #87

Open nevali opened 8 years ago

nevali commented 8 years ago

Some resources result in an extremely large set of precomposed quads being stored in the bucket. This is problematic because parsing the quads takes a long time and causes the API request to fail.

The attached example (gzipped; 20M uncompressed) demonstrates the problem.

We should:—

Internal tracking: RESDATA-1096

cgueret commented 8 years ago

"Determine what aspect of the stored data is bloating the quads" is probably the fact that all related sources are pointed to with a seeAlso. For the specific case of Geonames the rules in https://github.com/bbcarchdev/spindle/blob/develop/twine/rulebase.ttl#L593-L624 cause entities such as http://www.geonames.org/2653822/cardiff.html to be located in http://www.geonames.org/2634895/wales.html (parentADM1) as well as http://www.geonames.org/2635167/united-kingdom-of-great-britain-and-northern-ireland.html (parentCountry). Unless I am mistaken the proxy for the latter will then point to every single proxy related to it with a seeAlso. After some manual check it appears the attached example is the proxy for Somalia and has a number of Somalian cities and features attached to it by a seeAlso.

One way to fix that, which will not be a universal fix for similar situations but could anyway be interesting, would be to revise the rules for Geonames.

nevali commented 8 years ago

Per-authority rules aren't really a solution here. Anything which has a lot of inbound references will trigger the same thing (e.g., a programme genre).

cgueret commented 8 years ago

Sure. Another way to phrase the issue is to wonder if all those links are necessary... but it's then an issue of CBD versus SCBD and opinions may vary.

cgueret commented 8 years ago

I think we could decide to not materialise all those links and associate the resources at query time if we want. Like what DBPedia does on http://dbpedia.org/page/Somalia by services all the "is <> of" statements.