OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
105 stars 29 forks source link

Bypass bot detectors #166

Open LVerneyPEReN opened 3 years ago

LVerneyPEReN commented 3 years ago

Hi,

Rakuten and Leboncoin have very strong bot detectors, hence preventing from automatically fetching their CGUs (at least on a regular OVH machine). See https://fr.shopping.rakuten.com/newhelp/conditions-generales/ or https://www.leboncoin.fr/dc/cgu. It is possible that #138 and having JS enabled will help here, but I think this won't be enough.

Best,

EDIT: Same for RueDuCommerce (see https://www.rueducommerce.fr/info/mentions-legales/cgv) or FNAC (https://www.fnac.com/Help/cgv-fnac#bl=footer), they all use the same system, powered by Datadome.

Ndpnt commented 3 years ago

Hi,

I hope using a headless browser will fix this. So I suggest to wait for #138 to be implemented and see if there is still this issue. Unless you have a quicker to implement idea to fix it?

LVerneyPEReN commented 3 years ago

Using a headless browser is not enough to fix this. You have to disguise it (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth for instance) and you are still identified by your IP address (DataDome used on Leboncoin for instance does this), if you are connecting from a server infrastructure (not residential).

MattiSG commented 3 years ago

As discussed with @LucasVerneyDGE and @TomHouriezDGE, this option will be needed for some sources, even after #138 is fixed. However, it also raises legal questions. @LucasVerneyDGE will investigate which entities might have power to legally bypass access control systems, and we will design the most appropriate software architecture (opt-in, opt-out, plugin) based on the legal assessment 🙂

martinratinaud commented 2 years ago

Hi all

jumping back on this matter as we encounter it more and more often

One of the common issues we find is being confronted to a 403 due to Web Application Firewall (WAF)

We already encountered 3 of them with

@LVerneyPEReN do you have any news? I contacted Imperva and Cloudflare to become a whitelisted bot and am waiting for their answers

MattiSG commented 2 years ago

Legal analysis by PEReN was still pending on 08/03/2022.

Imperva and Cloudflare answers are still pending.

In order to help with prioritisation, instead of listing issues in this repository, they are now labeled in each affected instance with dedicated tags (403, timeout…).

MattiSG commented 1 year ago

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

martinratinaud commented 1 year ago

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

Indeed, we did not 😔

MattiSG commented 2 weeks ago

Cloudflare maintains a list of verified bots. They state “Cloudflare manually approves well-behaved services that benefit the broader Internet and honor robots.txt.” There is on this page a link to “add a bot” that requires having a Cloudflare account.