baturin / wikivoyage-listings

Data extracted from Wikivoyage, the free travel guide at http://wikivoyage.org. Leverage Wikivoyage listings on your smartphone, or in your own mashups.
http://wvpoi.batalex.ru/
Other
48 stars 27 forks source link

Implement bulk Wikidata validation using SPARQL service, closes #30 #48

Closed zstojanovic closed 7 years ago

zstojanovic commented 7 years ago

This PR creates a new BulkValidator interface and its first implementation - WikidataBulkValidator which replaces WikidataValidator. The new WikidataBulkValidator validates QIDs in batches of 200 using the SPARQL service which checks that the QID actually exists and that it's not a redirect.

The PR also replaces the uses of WikidataValidator with the WikidataBulkValidator in Main, ValidationReport and in ValidationTests.

I did some experiments with the SPARQL service and found that it can process around 400 QID in one request, and beyond that the server returns 413 - Request entity too large. Unfortunately this is not documented, and could change, so I suggest we use a conservative limit of 200, which seems like plenty to me.

The most interesting part was deciding on the design of the BulkValidator interface. The code that uses validators is designed to validate Listings one at a time, and process the results immediately which is not easily fitted to the concept of bulk validation.

Let me know if this solution is appropriate, and we can discuss and tweak it if necessary.

nicolas-raoul commented 7 years ago

Thanks! It will take some time to check and test, thanks for your patience :-)

nicolas-raoul commented 7 years ago

I forgot to talk about removing invalid Wikidata identifiers from output results, any idea how to implement this maybe? :-) #49