Reduce network traffic with minimal script
The minimal script disables javascript, does not download images and videos and by setting the additional "allowed_content_type": "text/html" we will not download css, fonts etc. All theses (sub)requests are currently triggered with the scroll_end_of_page_script. In general the request for otto and zalando should also work without splash by using scrapy.Request, but since we currently do not have proxies this is the best workaround to make use of all IPs on the cluster. I did not delete scroll_end_of_page_script, so that we can use it in the future for new shops where we have to execute javascript.
A few test results (from 14.04.2022) between minimal script and scrolling script, by using har() in return of splash requests show the differences in terms of downloaded data and # of requests:
Otto:
SERP
544 kb uncompressed with minimal script (1 request)
984 KB uncompressed with scrolling script (37 requests)
PRODUCT
828.6 KB uncompressed with minimal script, 1 request
3.1 MB uncompressed with scrolling script, 94 Requests
Zalando
SERP
819.6 KB uncompressed with minimal script, 2 requests
3.5 MB uncompressed with scrolling script, 63 requests
PRODUCT
935.4 KB uncompressed with minimal script, 2 Requests
4.5 MB uncompressed with scrolling script, 110 Requests
If you want to check the minimal_script functionality for Zalando, keep in mind that there is a new bug in the extractor due to a change in html, which is not related to the minimal script.
Otto extractor enhancements:
While checking the functionality of the minimal script I realized that the otto extraction for a lot of products was not working, which is due to a failure in the schema extraction. Additionally the extraction failed sometimes due to a missing brand. I added the functionality to extract the brand from JSON_LD. The improvement achieves the following result on some sample data:
2636 products in scraping table,
1896 products extracted without schema extraction error handling
2583 products extracted with the handling of JSONDecodeError in schema extraction
Reduce network traffic with minimal script The minimal script disables javascript, does not download images and videos and by setting the additional
"allowed_content_type": "text/html"
we will not download css, fonts etc. All theses (sub)requests are currently triggered with thescroll_end_of_page_script
. In general the request for otto and zalando should also work without splash by usingscrapy.Request
, but since we currently do not have proxies this is the best workaround to make use of all IPs on the cluster. I did not deletescroll_end_of_page_script
, so that we can use it in the future for new shops where we have to execute javascript.A few test results (from 14.04.2022) between minimal script and scrolling script, by using
har()
in return of splash requests show the differences in terms of downloaded data and # of requests:Otto:
Zalando
If you want to check the minimal_script functionality for Zalando, keep in mind that there is a new bug in the extractor due to a change in html, which is not related to the minimal script.
Otto extractor enhancements: While checking the functionality of the minimal script I realized that the otto extraction for a lot of products was not working, which is due to a failure in the schema extraction. Additionally the extraction failed sometimes due to a missing brand. I added the functionality to extract the brand from
JSON_LD
. The improvement achieves the following result on some sample data: