Open taythebot opened 3 months ago
Cookies are persisted per session, your second request is (almost certainly) getting a new session.
Cookies are persisted per session, your second request is (almost certainly) getting a new session.
How do I make sure the second request is using the same session?
What are you trying to do?
You could set maxPoolSize: 1
, that way there will be only one session. Otherwise I don't think we have a way to force a session id on new requests (but we should add one, that's a good point).
What are you trying to do?
The website I'm trying to scrape has a anti-bot feature where you need to wait in a access queue. The access queue page sends a Refresh
header which indicates the amount of seconds you need to wait. Afterwards you need to refresh the page to gain access. After you gain access you are given an access cookie which must be present in all future requests.
When I detect this I'm sleeping the required amount and then re-queuing the same URL. I can't find a way to refresh a pay via Cheerio directly so I'm having to requeue it with a different unique key. However this seems difficult to implement with many sessions since I cannot specify the request go through the same session. Maybe there's a better way to handle this use case in Crawlee I'm not aware of?
Can you give me the Url of that website ?
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/cheerio (CheerioCrawler)
Issue description
The CheerioCrawler is not persisting cookies at all. The
session
storage does have the cookies for therequest.url
but it is not being set. Manually trying to set it in thepreNavigationHooks
does not work assession.getCookieString(request.url)
is empty.useSessionPool: true
andpersistCookiesPerSession: true
Code sample
Package version
v3.11.1
Node.js version
v20.16.0
Operating system
MacOS Sonoma
Apify platform
I have tested this on the
next
releaseNo response
Other context
Here's a small Python script to test if Crawlee is properly setting cookies. It will set a cookie on
GET /