adbar / trafilatura

Python & command-line tool to gather text on the Web: Crawling & scraping, content extraction, metadata. TXT, Markdown, CSV & XML output.
https://trafilatura.readthedocs.io
Apache License 2.0
3.16k stars 238 forks source link

save cookies on redirect #478

Open zeliboba7 opened 5 months ago

zeliboba7 commented 5 months ago

I tried to use trafilatura with a website and got the following error:

(venv) mlosx:~/Sources/python_sitemap$ trafilatura --sitemap "https://www.mvideo.ru/" --list
unknown error: https://www.mvideo.ru HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))
urllib3.exceptions.ResponseError: too many redirects

The above exception was the direct cause of the following exception:
...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))

It seems the website wants to set cookies in the initial request and redirects to the same page. If the cookie is not returned in the next request, it redirects again (and again, again). Is it possible to handle this kind of website? i.e. to sent the required cookies on redirect? (like wget does this, see below)

It seems wget handles cookies by default:

(venv) mlosx:~/Sources/python_sitemap$ wget --server-response https://www.mvideo.ru
--2024-01-16 21:02:48--  https://www.mvideo.ru/
Resolving www.mvideo.ru (www.mvideo.ru)... 185.71.67.88
Connecting to www.mvideo.ru (www.mvideo.ru)|185.71.67.88|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:02:48 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023630be976ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
--2024-01-16 21:02:48--  https://www.mvideo.ru/
Reusing existing connection to www.mvideo.ru:443.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx
  Date: Tue, 16 Jan 2024 14:02:48 GMT
  Content-Type: text/html
  Content-Length: 30429
  Connection: keep-alive
  Set-Cookie: __lhash_=729e7ad9ccc1bb394a0d3f88f6f97812; Max-Age=604800; Path=/
  last-modified: Mon, 15 Jan 2024 21:39:31 GMT
  cache-control: max-age=0
  accept-ranges: bytes
  MVID-Uber-Trace-Id: dced347839fe0739:721d8cdca998ae5e:dced347839fe0739:1
  expires: Tue, 16 Jan 2024 14:02:48 GMT
  x-powered-by: Express
  set-cookie: MVID_AB_PERSONAL_RECOMMENDS=true; Domain=.mvideo.ru; Path=/; Expires=Tue, 30 Jan 2024 07:01:00 GMT
...
  set-cookie: MVID_ENVCLOUD=prod2; path=/
  rev: 04
  etag: "65a5a613-76dd"
  lbu-ha: prod2
  lbu: prod2-fc
  Cache-Control: no-cache
Length: 30429 (30K) [text/html]
Saving to: ‘index.html’

     0K .......... .......... .........                       100% 42.0M=0.001s

2024-01-16 21:02:49 (42.0 MB/s) - ‘index.html’ saved [30429/30429]

But redirection is repeated when cookies are off:

(venv) mlosx:~/Sources/python_sitemap$ wget --server-response --no-cookies https://www.mvideo.ru
--2024-01-16 21:12:40--  https://www.mvideo.ru/
Resolving www.mvideo.ru (www.mvideo.ru)... 185.71.67.88
Connecting to www.mvideo.ru (www.mvideo.ru)|185.71.67.88|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:12:40 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023b30beb76ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
--2024-01-16 21:12:40--  https://www.mvideo.ru/
Reusing existing connection to www.mvideo.ru:443.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:12:40 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023b30beb76ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
...
20 redirections exceeded.
(venv) mlosx:~/Sources/python_sitemap$

adbar commented 5 months ago

Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a common logic which is complicated provided this library doesn't focus on advanced downloads.

A first goal would be to configure the urllib3 session (the default) to accept and store cookies.