akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
453 stars 32 forks source link

Question: Archiving a page will retrieve the lastest (and outdated) copy of the page instead of save #96

Closed alicescfernandes closed 3 years ago

alicescfernandes commented 3 years ago

I know that the CLI tool as little to do with this, but maybe someone can point me to the right direction. I'm trying to archive a page, but instead of archiving a fresh copy, i'm getting an outdated copy of the page.

Any reason as to why this is happening?

akamhy commented 3 years ago

@alicescfernandes, I believe either it's because they (Wayback machine) are redirecting queued archives to older archives or maybe it's the code that checks for recent archives if save requests fails. If a page was saved recently, you have to wait for 30 minutes to capture another archive on Wayback machine.

I can't reproduce the bug with the URLs I tried, but it would be great if you can answer the following questions.

akamhy commented 3 years ago

I was able to reproduce the error, unfortunately, Wayback Machine is redirecting the save requests to older archives. I can't do anything about it, just wait for the service to function properly again.

akamhy at device in ~
$ python
Python 3.9.1 (default, Jan 16 2021, 22:31:00) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import requests
>>> url = "https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds"
>>> req_url = "https://web.archive.org/save/" + url 
>>> res = requests.get(req_url)
>>> res.status_code
200
>>> res.url
'https://web.archive.org/web/20191003050842/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds'
>>> res.headers
{'Server': 'nginx/1.19.5', 'Date': 'Fri, 16 Apr 2021 06:24:53 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-cache-control': 'private', 'x-archive-orig-x-frame-options': 'SAMEORIGIN', 'x-archive-orig-x-request-guid': '1a40552e-5301-48ba-9d3e-2c1c4a6e3973', 'x-archive-orig-strict-transport-security': 'max-age=15552000', 'x-archive-orig-feature-policy': "microphone 'none'; speaker 'none'", 'x-archive-orig-content-security-policy': "upgrade-insecure-requests; frame-ancestors 'self' https://stackexchange.com", 'x-archive-orig-accept-ranges': 'bytes, bytes', 'x-archive-orig-age': '0, 0', 'x-archive-orig-content-length': '163062', 'x-archive-orig-date': 'Thu, 03 Oct 2019 05:08:42 GMT', 'x-archive-orig-via': '1.1 varnish', 'x-archive-orig-connection': 'close', 'x-archive-orig-x-served-by': 'cache-sjc3139-SJC', 'x-archive-orig-x-cache': 'MISS', 'x-archive-orig-x-cache-hits': '0', 'x-archive-orig-x-timer': 'S1570079322.168218,VS0,VE90', 'x-archive-orig-vary': 'Fastly-SSL', 'x-archive-orig-x-dns-prefetch-control': 'off', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Thu, 03 Oct 2019 05:08:42 GMT', 'link': '<https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="original", <https://web.archive.org/web/timemap/link/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="timegate", <https://web.archive.org/web/20191003050842/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="first memento"; datetime="Thu, 03 Oct 2019 05:08:42 GMT", <https://web.archive.org/web/20191003050842/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="memento"; datetime="Thu, 03 Oct 2019 05:08:42 GMT", <https://web.archive.org/web/20191003050842/https://stackoverflow.com/questions/24547655/get-utc-time-in-seconds>; rel="last memento"; datetime="Thu, 03 Oct 2019 05:08:42 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'liveweb-20191003050632/live-20191003042816-wwwb-app6.us.archive.org.warc.gz', 'server-timing': 'exclusion.robots;dur=0.243619, PetaboxLoader3.datanode;dur=167.453087, PetaboxLoader3.resolve;dur=54.738681, CDXLines.iter;dur=21.925210, esindex;dur=0.014215, RedisCDXSource;dur=2.485175, LoadShardBlock;dur=170.232801, load_resource;dur=113.330522, captures_list;dur=198.724729, exclusion.robots.policy;dur=0.229176', 'x-app-server': 'wwwb-app58', 'x-ts': '200', 'x-tr': '747', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20191003050842/https://stackoverflow.com/questions/24547655/get-utc-time-in-secondsIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Content-Encoding': 'gzip'}
alicescfernandes commented 3 years ago

Thanks for taking a look at it! I suspected that would be related with the web archive itself, but really had no ideia why it was happening

alicescfernandes commented 3 years ago

Hi again! @akamhy something weird is happening with the service. As of today the service it's kinda working.

It still redirects me to a cached copy whenever i try the save now widget, but it's also caching a new one. The new copy doesn't appear immediately, but if you check the calendar on the following days, you can see that even though it was returning that old url, it was still caching a new copy

TheTechRobo commented 1 year ago

It still redirects me to a cached copy whenever i try the save now widget, but it's also caching a new one. The new copy doesn't appear immediately, but if you check the calendar on the following days, you can see that even though it was returning that old url, it was still caching a new copy

I'm late but this is why...

Sometimes when the WBM's servers are overloaded (?) or if you specifically request it, the Wayback Machine will not index the page for up to 12 hours. This is normal and happens occasionally.

The reason you're then seeing an older copy is because the Wayback Machine uses date fuzzing - for example, you could use 2006 as the timestamp like in https://web.archive.org/web/2006/https://hello.ca to get the capture closest to 2006. When you go to the archived page, because the capture isn't indexed, it fuzzes the date to the nearest capture - usually, the one directly before.

FanaticExplorer commented 11 months ago

Same problem to me. Wanted to implement waybackpy in telegram aiogram 3.x bot, but bot doesn't do anything. For example, this is my code:

@dp.message(Command("snapshot"))
async def cmd_start(message: Message):
    # ...
    link_save = WaybackMachineSaveAPI(url, user_agent)
    saved_link = link_save.save()
    formatted_datetime = link_save.timestamp().strftime("%H:%M:%S, %d %B %Y")
    await message.answer(f"The snapshot of {url} was saved. \nTime: {formatted_datetime}.", disable_web_page_preview = True)
    await message.answer(f"{saved_link}")

And even if the user sent the command in ⁣19:14, it still says: Time: 15:24:52, 07 August 2023. And giving the outdated version of page. The website itself also has only old copy.

@akamhy (sorry for ping, just aware that you won't see it, because this issue is quite old)