ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
22.44k stars 1.19k forks source link

Feature Request: Archive paywalled articles #1507

Open chris-fj opened 2 months ago

chris-fj commented 2 months ago

Type

What is the problem that your feature request solves

When archiving paywalled articles or pieces of news, you will be archiving the paywalled webpage. I was wondering if there's a way of setting up maybe the user agent or another solution that allows you to get the content as other archiving solutions do, for example archive.is

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

When archiving a web page hidden behind a paywall, archive the un-paywalled version. Example: original vs archived

What hacks or alternative solutions have you tried to solve the problem?

I have set up the user agent exactly as the one used by archive.is, but seems not to work

How badly do you want this new feature?

I'm fairly knowledgeable to python but wouldn't know where to start looking. I could try helping but surely would need guidance.

virtadpt commented 2 months ago

One of the easiest ways to do this would be to change the user agents to those of one of the more popular search engine crawlers. I've had a lot of success with this over the years.

chris-fj commented 2 months ago

Can you suggest some? I found a webpage with common UA for those crawlers and tried all of them to no avail.

virtadpt commented 2 months ago

A few out of my website's access logs from the last 24 hours, which I find is the best place to find them:

"Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

"Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"

"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36"

"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

pirate commented 2 months ago

Every paywall is different, some are easy to bypass by just disabling JS or changing User Agent, some are much harder and effectively require paying for an account and reusing the paying account credentials for archiving.

Bypassing social media / news media CAPTCHAS/login walls/fingerprinting is one of the services I provide to paying clients (because it's a lot of work to maintain and I cant open source too much or my methods will just get blocked): https://docs.monadical.com/s/archivebox-consulting-services

Personally I pay for a few key news sites and re-use those credentials for archiving, and for stuff I don't pay for, a few can still be archived in text-only form by readability, but some do actually fail and I only get the paywall page archived. I think it will forever be a cat-and-mouse game because it makes sense for the media sites to invest in preventing people getting their stuff for free.

huyz commented 2 months ago

One approach is to run https://github.com/bpc-clone/bypass-paywalls-chrome-clean in ArchiveBox's Chromium

chris-fj commented 1 month ago

Thanks for the suggestions and sorry for the delay in answering, I was caught in a busy weeks. @huyz have you made this work?

huyz commented 1 month ago

@chris-fj Not yet as I've been waiting to try out of one of the release candidates because I need Chrome extra flags to run this extension loaded locally: https://github.com/ArchiveBox/ArchiveBox/discussions/1480#discussioncomment-10255169

pirate commented 1 month ago

I'm actually going to leave this open to continue general ongoing discussion about paywalls and ways to handle them, as I think it's useful and a common question.

(You can unsubscribe on the right if you'd rather not keep getting notifications)