Open vdusek opened 2 weeks ago
Should we include HTTP headers in the unique_key (extended_unique_key) computation?
I think yes, you should do that. In addition to the Accept-Language mentioned. You can consider the situation when within the crawler work requests are executed from different authorized users. The only difference is the header containing the authorization token. There are other special cases when the header has a significant impact on the content of the response.
Should we implement a dont_filter feature?
Yes. For example, for cases where the server returns a 200 response status. But the response body contains data that an error occurred and this request should be executed again. If I see the current implementation correctly, this will not be possible without this option.
Context
A while ago, Honza Javorek raised some good points regarding the deduplication process in the request queue (#190).
The first one:
In response, we improved the unique key generation logic in the Python SDK (PR #193) to align with the TS Crawlee. This logic was lates copied to
crawlee-python
and can be found in crawlee/_utils/requests.py.The second one:
Currently, HTTP headers are not considered in the computation of unique keys. Additionally, we do not offer an option to explicitly bypass request deduplication, unlike the
dont_filter
option in Scrapy (docs).Questions
unique_key
(extended_unique_key
) computation?dont_filter
feature?use_extended_unique_key
be set as the default behavior?