Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
373 stars 100 forks source link

Provide useful implementation of fetchIsDone in JsonOnlineDumpFile #230

Open addshore opened 8 years ago

addshore commented 8 years ago

Even though the WMF does not make it easy to check for these dump files it is still possible to check, or at least try..

I my application at https://github.com/wikimedia/analytics-wmde-toolkit-analyzer in https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/java/analyzer/src/main/java/org/wikidata/analyzer/Fetcher/DumpFetcher.java I have a fallback through various dump locations. I would like to also add archive.org as a final final fallback here.

This is very hard as the final check on line 69 https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/java/analyzer/src/main/java/org/wikidata/analyzer/Fetcher/DumpFetcher.java#L69 always returns true.

A rough draft of my archive.org DumpFile implementation can be seen at https://gerrit.wikimedia.org/r/#/c/282731/

mkroetzsch commented 8 years ago

It seems something has changed on the WMF servers without notice, breaking the JSON dump file lookup in WDTK. This must have happened in the past few days. Is your issue related to this? Pull requests are generally welcome. (including @guenthermi who ran into the JSON dump issue yesterday)

addshore commented 8 years ago

This is not related to the issue I was reporting here, however I may have just run into this. Redirects perhaps?

https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/analyzer/src/main/java/org/wikidata/analyzer/Fetcher/RedirectFollowingWebResourceFetcherImpl.java

addshore commented 8 years ago

Also regarding the archive.org lookup I have actually implemented this at https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/analyzer/src/main/java/org/wikidata/analyzer/Fetcher/ArchiveOrgJsonOnlineDumpFile.java

And it can be seen im my fallback of dump sources at https://github.com/wikimedia/analytics-wmde-toolkit-analyzer/blob/master/analyzer/src/main/java/org/wikidata/analyzer/Fetcher/DumpFetcher.java#L89

In this code I simply do onlineDump.prepareDumpFile to check to see if the dumpFIle is actually there. This of course has the side effect of downloading the whole dump. It may be that my use case doesn't actually want an better implementation of fetchIsDone but instead an exists method!

addshore commented 8 years ago

Possible cause https://lists.wikimedia.org/pipermail/wikitech-l/2016-April/085155.html ?

mkroetzsch commented 8 years ago

Merging #231 fixed the critical issue that no dumps could be downloaded. I guess the general aspect discussed here remains valid. Can you use master or do you also need a new release?

addshore commented 8 years ago

Well, I hadn't actually run into the issue we have just fixed when initially filing this ticket (they were totally separate). This ticket is for there to be some way to programmatically check to see if a dump is there without actually having to download it!

The was I would want to use this is:

class DumpFetcher{

public Dump fetchDump( String dateStamp ) {

// Look for dumps stores locally for the given date
// If that fails look on dumps.wm.org for the dump of the given date (but dont download it yet)
// If no dump exists there then look on archive.org (but dont download it yet)

return dump;

}

}