Open zeliboba7 opened 5 months ago
Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a common logic which is complicated provided this library doesn't focus on advanced downloads.
A first goal would be to configure the urllib3 session (the default) to accept and store cookies.
I tried to use trafilatura with a website and got the following error:
It seems the website wants to set cookies in the initial request and redirects to the same page. If the cookie is not returned in the next request, it redirects again (and again, again). Is it possible to handle this kind of website? i.e. to sent the required cookies on redirect? (like
wget
does this, see below)It seems
wget
handles cookies by default:But redirection is repeated when cookies are off: