OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
109 stars 30 forks source link

Imperva Incapsula prevent bot detection #319

Closed clementbiron closed 1 year ago

clementbiron commented 3 years ago

Trying to add Just Eat service with the following declaration

{
  "name": "Just Eat",
  "documents": {
    "Terms of Service": {
      "fetch": "https://www.just-eat.ie/info/terms-and-conditions",
      "select": {
        "startBefore": "#just-eat-website-terms-and-conditions",
        "endBefore": "#ii.just-eat-voucher-terms-conditions"
      }
    },
    "Privacy Policy": {
      "fetch": "https://www.just-eat.ie/info/privacy-policy",
      "select": [".main-text"]
    },
    "Trackers Policy": {
      "fetch": "https://www.just-eat.ie/info/cookies-policy",
      "select": [".main-text"]
    }
  }
}

I get this error message Content inacessible: Error: The document cannot be accessed or its content can not be selected: The provided selector ".main-text" has no match in the web page https://www.just-eat.ie/info/cookies-policy.

The saved snapshot contains incorrect data:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

Some research leads me to believe that it is the following service https://www.imperva.com/products/advanced-bot-protection-management/ which seems to be well explained here https://www.imperva.com/blog/how-incapsula-client-classification-challenges-bots/

martinratinaud commented 3 years ago

I did 3 things on this matter

Here is the content of my communication to them

Hi,
My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs.

We are running the OpenSource project "Open Terms Archive" which aims at tracking 
ToS for every service in the world, in all languages and all countries.

As such we are implementing a crawler that tracks changes on ToS regularly.

Could we get in touch so that we become a known and trusted bot.

Thanks a lot

Check our websites here: 
https://www.opentermsarchive.org/en
https://disinfo.quaidorsay.fr/en
martinratinaud commented 3 years ago

Had a chat with Imperva and finally send an email on support@imperva.com

Hi,
My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs and Henri Verdier in CC is the ambassador.

We are running the OpenSource project "Open Terms Archive" which aims at tracking ToS for every service in the world, in all languages and all countries.

As such we are implementing a crawler that tracks changes on ToS regularly.

We know we are currently blocked by your services and would like our bot to be trusted by Imperva as a good bot (whitelisted) so that we are not blocked anymore

Thanks a lot

Check our websites here:
https://www.opentermsarchive.org/en
https://disinfo.quaidorsay.fr/en
MattiSG commented 1 year ago

We do not actively work on #166 at the moment. We will reopen it when we prioritise this work again. In the meantime, feel free to add any additional relevant information specific to Imperva Incapsula to this issue.