akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
465 stars 34 forks source link

Parameter to filter out redirects from one result methods #173

Open Forage opened 2 years ago

Forage commented 2 years ago

Is your feature request related to a problem? Please describe. Methods near, oldest, newest return no matter what type of snapshot is available. This also includes redirects, which aren't that useful in a lot of cases.

Describe the solution you'd like Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter. The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

ArztKlein commented 2 years ago

I think that's already a feature with the 'filter' parameter for the WaybackMachineCDXServerAPI class.

One example given in the tests file is:

cdx = WaybackMachineCDXServerAPI(
        url="google.com",
        user_agent=user_agent,
        filters=["statuscode:200"],
    )
Forage commented 2 years ago

Yes, but that's using Python and performing multiple steps: preparing, getting the results, looping through them. The one line "near, oldest, newest" methods are handy to avoid all that, especially through CLI.

The filter CLI argument is ignored when using one of those methods if I'm not mistaken. So either the filter could be made to be taken into account for most flexibility or an additional "okonly" quick argument could be introduced.

akamhy commented 2 years ago

For URLs with 200 only status code:

akamhy@device:~$  waybackpy  --url google.com --user-agent "foobar" --cdx --cdx-filter "statuscode:200" --limit "10" --start-timestamp "20101010" --cdx-print "archiveurl" --cdx-print "statuscode"
200 https://web.archive.org/web/20101010000314/http://www.google.com/
200 https://web.archive.org/web/20101010011249/http://www.google.com/
200 https://web.archive.org/web/20101010042108/http://www.google.com/
200 https://web.archive.org/web/20101010043106/http://www.google.com/
200 https://web.archive.org/web/20101010044436/http://www.google.com/
200 https://web.archive.org/web/20101010053035/http://www.google.com/
200 https://web.archive.org/web/20101010054150/http://www.google.com/
200 https://web.archive.org/web/20101010061344/http://www.google.com/
200 https://web.archive.org/web/20101010063445/http://www.google.com/
200 https://web.archive.org/web/20101010082449/http://www.google.com/
200 https://web.archive.org/web/20101010091719/http://www.google.com/
200 https://web.archive.org/web/20101010091734/http://www.google.com/
200 https://web.archive.org/web/20101010091920/http://www.google.com/
200 https://web.archive.org/web/20101010092939/http://www.google.com/

Non-200 status code:

akamhy@device:~$  waybackpy  --url google.com --user-agent "foobar" --cdx --cdx-filter \!statuscode:200 --limit "10" --start-timestamp "20101010" --cdx-print "archiveurl" --cdx-print "statuscode"  
301 https://web.archive.org/web/20101010003320/http://google.com/
301 https://web.archive.org/web/20101010042732/http://google.com/
301 https://web.archive.org/web/20101010101435/http://google.com/
301 https://web.archive.org/web/20101010110520/http://google.com/
301 https://web.archive.org/web/20101010111101/http://google.com/
301 https://web.archive.org/web/20101010162008/http://google.com/
- https://web.archive.org/web/20101011010719/http://google.com/
302 https://web.archive.org/web/20101011031541/http://www.google.com/
301 https://web.archive.org/web/20101011094854/http://google.com/
302 https://web.archive.org/web/20101011103045/http://www.google.com/
302 https://web.archive.org/web/20101011103404/http://www.google.com/
302 https://web.archive.org/web/20101011125706/http://www.google.com/
302 https://web.archive.org/web/20101011130420/http://www.google.com/
302 https://web.archive.org/web/20101011130758/http://www.google.com/
302 https://web.archive.org/web/20101011145009/http://www.google.com/
302 https://web.archive.org/web/20101011150448/http://www.google.com/
301 https://web.archive.org/web/20101012023319/http://google.com/
301 https://web.archive.org/web/20101012043932/http://google.com/
301 https://web.archive.org/web/20101012045200/http://google.com/
301 https://web.archive.org/web/20101012072233/http://google.com/
302 https://web.archive.org/web/20101012080016/http://www.google.com/
302 https://web.archive.org/web/20101012082545/http://www.google.com/
302 https://web.archive.org/web/20101012113351/http://www.google.com/
302 https://web.archive.org/web/20101012114314/http://www.google.com/
302 https://web.archive.org/web/20101012114658/http://www.google.com/
302 https://web.archive.org/web/20101012114803/http://www.google.com/
302 https://web.archive.org/web/20101012115016/http://www.google.com/
302 https://web.archive.org/web/20101012115409/http://www.google.com/
301 https://web.archive.org/web/20101012142403/http://google.com/
302 https://web.archive.org/web/20101012153200/http://www.google.com/
akamhy commented 2 years ago

Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter. The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

And how do we detect the redirects(status 302 Found) and statuses? By visiting the archive to actually check or by just reading the CDX data for the archive?

Forage commented 2 years ago

Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter. The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

And how do we detect the redirects(status 302 Found) and statuses? By visiting the archive to actually check or by just reading the CDX data for the archive?

By relying on the CDX status code yes, 200 or not 200.

But yes, you are right, your given example could do the trick as well. I'm happy using that if limit to one would work, but it looks like the limit parameter is ignored completely.

akamhy commented 2 years ago

The limit is not ignored but it is actually a CDX API param to limit number of archive data returned per API call when using the non-paginated CDX API. see https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#query-result-limits.

Forage commented 2 years ago

Maybe I'm misunderstanding its purpose, but unlike with your example where I get a lot more than the set limit of 10 results, when I call the API directly as in the API docs I only get what I set the limit to: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=2