Open chris-fj opened 1 month ago
One of the easiest ways to do this would be to change the user agents to those of one of the more popular search engine crawlers. I've had a lot of success with this over the years.
Can you suggest some? I found a webpage with common UA for those crawlers and tried all of them to no avail.
A few out of my website's access logs from the last 24 hours, which I find is the best place to find them:
"Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
"Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36"
"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Every paywall is different, some are easy to bypass by just disabling JS or changing User Agent, some are much harder and effectively require paying for an account and reusing the paying account credentials for archiving.
Bypassing social media / news media CAPTCHAS/login walls/fingerprinting is one of the services I provide to paying clients (because it's a lot of work to maintain and I cant open source too much or my methods will just get blocked): https://docs.monadical.com/s/archivebox-consulting-services
Personally I pay for a few key news sites and re-use those credentials for archiving, and for stuff I don't pay for, a few can still be archived in text-only form by readability, but some do actually fail and I only get the paywall page archived. I think it will forever be a cat-and-mouse game because it makes sense for the media sites to invest in preventing people getting their stuff for free.
One approach is to run https://github.com/bpc-clone/bypass-paywalls-chrome-clean in ArchiveBox's Chromium
Thanks for the suggestions and sorry for the delay in answering, I was caught in a busy weeks. @huyz have you made this work?
@chris-fj Not yet as I've been waiting to try out of one of the release candidates because I need Chrome extra flags to run this extension loaded locally: https://github.com/ArchiveBox/ArchiveBox/discussions/1480#discussioncomment-10255169
I'm actually going to leave this open to continue general ongoing discussion about paywalls and ways to handle them, as I think it's useful and a common question.
(You can unsubscribe on the right if you'd rather not keep getting notifications)
Type
What is the problem that your feature request solves
When archiving paywalled articles or pieces of news, you will be archiving the paywalled webpage. I was wondering if there's a way of setting up maybe the user agent or another solution that allows you to get the content as other archiving solutions do, for example archive.is
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
When archiving a web page hidden behind a paywall, archive the un-paywalled version. Example: original vs archived
What hacks or alternative solutions have you tried to solve the problem?
I have set up the user agent exactly as the one used by archive.is, but seems not to work
How badly do you want this new feature?
[X] It would be nice to have eventually
I'm fairly knowledgeable to python but wouldn't know where to start looking. I could try helping but surely would need guidance.