Open LVerneyPEReN opened 3 years ago
Hi,
I hope using a headless browser will fix this. So I suggest to wait for #138 to be implemented and see if there is still this issue. Unless you have a quicker to implement idea to fix it?
Using a headless browser is not enough to fix this. You have to disguise it (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth for instance) and you are still identified by your IP address (DataDome used on Leboncoin for instance does this), if you are connecting from a server infrastructure (not residential).
As discussed with @LucasVerneyDGE and @TomHouriezDGE, this option will be needed for some sources, even after #138 is fixed. However, it also raises legal questions. @LucasVerneyDGE will investigate which entities might have power to legally bypass access control systems, and we will design the most appropriate software architecture (opt-in, opt-out, plugin) based on the legal assessment 🙂
Hi all
jumping back on this matter as we encounter it more and more often
One of the common issues we find is being confronted to a 403 due to Web Application Firewall (WAF)
We already encountered 3 of them with
@LVerneyPEReN do you have any news? I contacted Imperva and Cloudflare to become a whitelisted bot and am waiting for their answers
Legal analysis by PEReN was still pending on 08/03/2022.
Imperva and Cloudflare answers are still pending.
In order to help with prioritisation, instead of listing issues in this repository, they are now labeled in each affected instance with dedicated tags (403, timeout…).
@LVerneyPEReN did the PEReN finish its legal analysis? 🙂
On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).
@LVerneyPEReN did the PEReN finish its legal analysis? 🙂
On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).
Indeed, we did not 😔
Cloudflare maintains a list of verified bots. They state “Cloudflare manually approves well-behaved services that benefit the broader Internet and honor robots.txt.” There is on this page a link to “add a bot” that requires having a Cloudflare account.
Hi,
Rakuten and Leboncoin have very strong bot detectors, hence preventing from automatically fetching their CGUs (at least on a regular OVH machine). See https://fr.shopping.rakuten.com/newhelp/conditions-generales/ or https://www.leboncoin.fr/dc/cgu. It is possible that #138 and having JS enabled will help here, but I think this won't be enough.
Best,
EDIT: Same for RueDuCommerce (see https://www.rueducommerce.fr/info/mentions-legales/cgv) or FNAC (https://www.fnac.com/Help/cgv-fnac#bl=footer), they all use the same system, powered by Datadome.