efixler / scrape

Web Scraping Service
Mozilla Public License 2.0
6 stars 4 forks source link

www.geo.fr gets stuck in a redirect loop because it depends on cross-domain cookies #1

Open efixler opened 10 months ago

efixler commented 10 months ago

When navigating to https://www.geo.fr/environnement/comment-la-secheresse-a-fait-decoller-le-skateboard-dans-la-californie-des-annees-1970-217924 the request shows the following behavior:

Request:

GET /environnement/comment-la-secheresse-a-fait-decoller-le-skateboard-dans-la-californie-des-annees-1970-217924 HTTP/2
Host: www.geo.fr
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:120.0) Gecko/20100101 Firefox/120.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Sec-GPC: 1
Connection: keep-alive
Cookie: _lr_sampling_rate=0
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: cross-site
TE: trailers

Responds with a 302:

HTTP/2 302 
server: AkamaiGHost
content-length: 0
location: https://consents.prismamedia.com?redirectHost=https%3A%2F%2Fwww.geo.fr&redirectUri=%2fenvironnement%2fcomment-la-secheresse-a-fait-decoller-le-skateboard-dans-la-californie-des-annees-1970-217924
date: Wed, 03 Jan 2024 15:02:02 GMT
X-Firefox-Spdy: h2

The request that it then sends to the prismedia host contains a cookie:

GET /?redirectHost=https%3A%2F%2Fwww.geo.fr&redirectUri=%2fenvironnement%2fcomment-la-secheresse-a-fait-decoller-le-skateboard-dans-la-californie-des-annees-1970-217924 HTTP/1.1
Host: consents.prismamedia.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:120.0) Gecko/20100101 Firefox/120.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Sec-GPC: 1
Connection: keep-alive
Cookie: authId=d6d6e70eb02ef36655fb141159f15f63
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: cross-site

And that host responds like this:

HTTP/1.1 302 Moved Temporarily
Server: AkamaiGHost
Content-Length: 0
Location: https://www.geo.fr?authId=d6d6e70eb02ef36655fb141159f15f63&redirectUri=%2fenvironnement%2fcomment-la-secheresse-a-fait-decoller-le-skateboard-dans-la-californie-des-annees-1970-217924
Date: Wed, 03 Jan 2024 15:02:02 GMT
Connection: keep-alive
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Origin: *

The authId is subsequently used for more 302s at www.geo.fr.

Question here is if there's a generalizeable behavior that tells us to copy cookies across crossdomain requests.

efixler commented 10 months ago

Also affects femmeactuelle.fr' and neonmag.fr (anything that does through prismedia.com