av1d / cdx-tools

a collection of tools for working with the Wayback Machine CDX server
GNU General Public License v3.0
3 stars 0 forks source link

Seems to miss some files #1

Open mariomadproductions opened 3 weeks ago

mariomadproductions commented 3 weeks ago

e.g. cdxpress --url nintendo.com --scan=.mp3 misses https://web.archive.org/web/20140905064931/http://supermario3dworld.nintendo.com/_ui/audio/RedcBGM_07.mp3

av1d commented 2 weeks ago

Thanks for bringing this to my attention. This was an interesting issue and a learning experience.

The issue is that with cdxpress we're pulling only files with HTTP 200 status, which I wrongfully assumed was sufficient. If we pull your MP3 link by using this query (warning: 700MB JSON file) we can see it is stored in the server with a 302 status:

["com,nintendo)/_ui/audio/redcbgm_07.mp3","20190821133045","https://www.nintendo.com/_ui/audio/RedcBGM_07.mp3","text/html","302","BFCSGPOGIG3F4IURX5LVNGPBSRAS2ZY3","761"],

Therefore, the file is omitted from the results at the server level because cdxpress was filtering by HTTP 200 status only. Unfortunately the fix isn't an easy one. CDX doesn't allow stacking of statustype filters. This was tested with the following queries, none of which returned results:

av1d@superscape[~/test/cdxpress]$ curl 'https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&limit=2&filter=statuscode:200&filter=statuscode:206&filter=statuscode:301&filter=statuscode:302&filter=statuscode:303&filter=statuscode:307&filter=statuscode:308' -o test4.json
av1d@superscape[~/test/cdxpress]$ curl 'https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&filter=statuscode:200&filter=statuscode:206&filter=statuscode:301&filter=statuscode:302&filter=statuscode:303&filter=statuscode:307&filter=statuscode:308' -o test4.json
av1d@superscape[~/test/cdxpress]$ curl 'https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&limit=2&filter=statuscode:200,statuscode:206,statuscode:301,statuscode:302,statuscode:303,statuscode:307,statuscode:308' -o test4.json
av1d@superscape[~/test/cdxpress]$ curl "https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&filter=statuscode:200,statuscode:206,statuscode:301,statuscode:302,statuscode:303,statuscode:307,statuscode:308" -o test4.json
av1d@superscape[~/test/cdxpress]$ curl 'https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&limit=2&filter=statuscode:200,201,301,302,303,307,308' -o test4.json
av1d@superscape[~/test/cdxpress]$ curl 'https://web.archive.org/cdx/search/cdx?url=nintendo.com&output=json&filter=statuscode:200,201,301,302,303,307,308' -o test4.json

In other words, the logic to handle this case wil need to be done in code. The side effect of this is that the dataset being downloaded will be exponentially larger when used on a large domain. Therefore, when I fix it, I will implement it in a way that this change will be the default option, and the original method will be a "speed" option and include a disclaimer that it may miss files from queries.

I don't have an ETA on this fix but I will try to get to it within the next week. Thanks for bringing this up, it was very informative and useful information.

mariomadproductions commented 2 weeks ago

Thanks. Yeah, that's annoying, but this sounds like a good solution.