akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
462 stars 34 forks source link

get() doesn't return the correct version of archives #68

Closed dequeued0 closed 3 years ago

dequeued0 commented 3 years ago

get() is fetching the wrong version of archives.

I've seen this happen to multiple different archives and it is a reproducible issue. Here is an example:

>>> import waybackpy
>>> user_agent = "something goes here"
>>> url = 'https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> url_object = waybackpy.Url(url, user_agent)
>>> archive = url_object.newest()
>>> str(archive)
'https://web.archive.org/web/20210111103734/https://old.reddit.com/r/personalfinance/comments/kr3pbk/kingz_forex_academy_come_and_learn_how_to_trade/'
>>> archive.get().find("[removed]")
50479
>>> archive.get().find("become your own BOSS")
-1

However, that archive version does not include the string [removed] and it should show the spam text that included the become your own BOSS phrase.

If you fetch the page separately, you can [removed] is not present and the expected text is there.

>>> import requests
>>> response = requests.get(str(archive))
>>> response.text.find("[removed]")
-1
>>> response.text.find("become your own BOSS")
61087

For some reason, get() seems to be fetching an older version of the archive.

akamhy commented 3 years ago

fixed in https://github.com/akamhy/waybackpy/commit/6142e0b353dbbc233a04b03fbabbf5232d80d9d6

The get method was default to URL passed in waybackpy.Url(url, user_agent), changed it to get the last retrieved archive.