Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
372 stars 100 forks source link

Fix online dump downloads for the various modes and Closes #869 #872

Closed TheEaterr closed 2 months ago

TheEaterr commented 4 months ago

It seems that there were changes in the way dumps are structured for wikidata making the way dump files were automatically downloaded not functional anymore.

Here are the fixes implemented, and the reasoning behind it : DAILY : update the check for an available dump as status.txt now indicated "done:all". See : https://dumps.wikimedia.org/other/incr/wikidatawiki/20240419/status.txt I put startsWith but it might be preferable to check for done:all equality ? I don't know the various possible values but this version should work at least most of the time (compared to none right now). JSON : The JSON dumps are currently downloaded from this folder : https://dumps.wikimedia.org/other/wikidata/ It is a bit of a mess including both full and incremental (?) dumps. I switch it to use https://dumps.wikimedia.org/wikidatawiki/entities/ (which seems to be equivalent to https://dumps.wikimedia.org/other/wikibase/wikidatawiki/, I don't know if anyone has any insight into which link should be preferred). Moreover, not all shown dates feature the full json dump so I added a check that the file is actually there (I have not found a status.txt or equivalent for the entitity dump) SITES : Nothing done, haven't tested it but the file it is looking for is still there so it probably still works. CURRENT / FULL : Both of these look for a -pages-meta-current.xml.bz2 / -pages-meta-history.xml.bz2 inside a dump (like https://dumps.wikimedia.org/wikidatawiki/20240401/). However these files do not seem to be there. Nevertheless, parts of these file ARE present (of form https://dumps.wikimedia.org/wikidatawiki/20240401/wikidatawiki-20240401-pages-meta-current1.xml-p1p441397.bz2 for example) but not the united file. If anyone knows why the full file isn't there I could try to fix the issue

TheEaterr commented 4 months ago

I made a new commit that uses head requests instead. I had to tap into the specific semantics of the WebRessourceFetcherImpl to change the type of the request, maybe a more broad set method should be added to the WebRessourceFetcher interface ?

Also I updated the test that failed by instead using the real - and not the mock - WebRessourceFetcher and changing around what is expected. I imagine using MockWebRessourceFetcher was intentional but for this specific use case it feels weird, as the part that broke is the online part and I feel like the test should catch that. Else it is probably possible to change the code around a bit to make it work with mocks. If checking online is desirable, it would also be logical to update the other tests.

wetneb commented 3 months ago

Thanks a lot!

It looks like the new version of the test does not pass. In general, I would say it would be better to avoid making any requests to online services in the test suite, to keep it reliable. It is true that this prevents us from checking compatibility with the online service as it evolves, but in my opinion that's not the role of such a Java unit test suite. I don't know if this would have helped in this case, but it is also possible to change the mocking strategy and use a MockWebServer to test against a real but local web server serving pre-determined responses.

TheEaterr commented 2 months ago

Hey so actually the test that failed wasn't the test I fixed that uses online, but another one that I missed when developing locally. The online test does work (as far as I understood the CI output), and I feel makes sense as the getAllJsonDumps online functionality is a core part of it, and the test should be expected to fail if upstream changes. So if all test pass (hopefully) I think it's better this way but if you disagree I'll switch it back to an offline test.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.11%. Comparing base (637b457) to head (b660991). Report is 233 commits behind head on master.

Files Patch % Lines
...ikidata/wdtk/dumpfiles/wmf/JsonOnlineDumpFile.java 83.33% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #872 +/- ## ============================================ + Coverage 80.92% 83.11% +2.18% - Complexity 2090 2763 +673 ============================================ Files 149 183 +34 Lines 7278 8778 +1500 Branches 897 1163 +266 ============================================ + Hits 5890 7296 +1406 - Misses 1119 1169 +50 - Partials 269 313 +44 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

wetneb commented 2 months ago

@TheEaterr thanks a lot! I see codecov is complaining about missing diff coverage. If you think this can be addressed, we can do that, otherwise happy to merge it as such, I don't want to stand in the way.

TheEaterr commented 2 months ago

Code coverage was right, the function should've been tested. I added a test it should be good now!