akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
462 stars 34 forks source link

Cdx() is not working for some URLs #69

Closed dequeued0 closed 3 years ago

dequeued0 commented 3 years ago

It works if I set the url to your Github page as in your example, but when I try this URL, it does not work.

I am using version 2.4.0.

>>> from waybackpy import Cdx
>>> user_agent = 'some user agent string'
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> cdx = Cdx(url=url, user_agent=user_agent)
>>> snapshots = cdx.snapshots()
>>> for i, snapshot in  enumerate(snapshots, start =1):
...     snapshot_printer(i, snapshot)
... 
akamhy commented 3 years ago

I guess I need to stop using pagination API if total pages <= 2

Without the pagination API I can see 4 archives. https://web.archive.org/cdx/search/cdx?url=https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/&limit=1000

akamhy commented 3 years ago

fixed in https://github.com/akamhy/waybackpy/commit/a65990aee320ec401d9364801220dc044c22ebbc

For future : Why are we even using the pagination API if it can be lagged? Some users use this tool to retrieve huge amount (10M+) of archives from wayback machine CDX API. And pagination API can be used multi-threaded. And without causing too much strain on the servers.