OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
111 stars 30 forks source link

Received HTTP code 403 when trying to fetch a site using Cloudflare #316

Closed clementbiron closed 1 year ago

clementbiron commented 3 years ago

Trying to add Roblox service and documents with the following declaration

{
  "name": "Roblox",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-",
      "select": [".article-body"],
      "remove": [".wysiwyg-text-align-right img"]
    },
    "Terms of Service": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use",
      "select": [".article"],
      "remove": [".article-relatives", ".article-footer"]
    },
    "Community Guidelines": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules",
      "select": [".article"],
      "remove": [".article-footer", ".article-relatives"]
    }
  }
}

I get this node error messages

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules'

clementbiron commented 3 years ago

Same error trying to add Coinbase documents with following declaration

{
  "name": "Coinbase",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.coinbase.com/legal/privacy",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Trackers Policy": {
      "fetch": "https://www.coinbase.com/legal/cookie",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Terms of Service": {
      "fetch": "https://www.coinbase.com/legal/user_agreement/ireland_europe",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    }
  }
}

Content inacessible: Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.coinbase.com/legal/user_agreement/ireland_europe'

martinratinaud commented 3 years ago

This is mainly because those sites are using a service like cloudflare to check their traffic

Our attempt to scrape is evaluated as a bot and thus is blocked by a 403.

I tried the following all these with no success

So I suggest for now that you use "executeClientScripts"

In the meantime, I've send a ticket request to Cloudflare through my personnal premium account. Let's see what they say

Hi, My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs.  

We are running the OpenSource project "Open Terms Archive" which aims at tracking ToS for every 
service in the world, in all languages and all countries.  
As such, we are implementing a crawler that tracks changes on ToS regularly.  
We know we are currently blocked by your services and would like our bot to be trusted 
by Cloudflare as a good bot (whitelisted) so that we are not blocked anymore 

Thanks a lot

Check our websites here: 
https://www.opentermsarchive.org/en 
https://disinfo.quaidorsay.fr/en
martinratinaud commented 3 years ago

And here is the response of cloudflare

Hi there,

Thanks for contacting Cloudflare support. My name is Yuri and I will be looking into this ticket for you.

To add a bot to Cloudflare's allowlist, please submit this online application.

For more information, please see: Frequently asked questions about Cloudflare bot products

Please let us know if you have any further questions or issues.

Yuri | Cloudflare Support
Search the Cloudflare Community for advice and insight.

Online application: https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform FAQ: https://support.cloudflare.com/hc/en-us/articles/360035387431-Frequently-asked-questions-about-Cloudflare-bot-products?source=search

@trujilloelsa @clementbiron @MattiSG I believe we should apply, what about you ?

clementbiron commented 3 years ago

Yes ✔️

martinratinaud commented 3 years ago

Validation approval just submitted

docs google com_forms_d_e_1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA_viewform

Waiting for their answer

martinratinaud commented 2 years ago

As we have not had any answer in 40 days, I created a new topic on Cloudflare community

https://community.cloudflare.com/t/cloudflare-bot-verification-submitted-but-no-answer/320260

clementbiron commented 2 years ago

I'm not sure this is a Cloudflare protection but running npm start Galeries Lafayette i get

2022-02-22 16:19:18 warn  Galeries Lafayette — Privacy Policy                     The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/service-confidence'
2022-02-22 16:19:18 warn  Galeries Lafayette — Terms of Service                   The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/conditions-generals'

with the following declaration

{
  "name": "Galeries Lafayette",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.galerieslafayette.com/service/service-confidence",
      "select": [".mainContent"]
    },
    "Terms of Service": {
      "fetch": "https://www.galerieslafayette.com/service/conditions-generals",
      "select": [".mainContent"]
    }
  }
}
clementbiron commented 2 years ago

Same for

{
  "name": "GO Sport",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.go-sport.com/cgv/",
      "select": ["#content"]
    },
    "Privacy Policy": {
      "fetch": "https://www.go-sport.com/charte-protection-donnees-clients/",
      "select": ["#content"]
    }
  }
}
clementbiron commented 2 years ago

Same for this declaration https://github.com/OpenTermsArchive/declarations-france/commit/a0e6b465a74d2f60d5a48f014d5219801841c576

clementbiron commented 2 years ago

I'm not sure it's about Cloudflare protection, but the following declarations return a 403 error:

MattiSG commented 1 year ago

We do not actively work on #166 at the moment. We will reopen it when we prioritise this work again. In the meantime, feel free to add any additional relevant information specific to Cloudflare to this issue.