calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Reduce network traffic and improve otto extraction #63

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

Reduce network traffic with minimal script The minimal script disables javascript, does not download images and videos and by setting the additional "allowed_content_type": "text/html" we will not download css, fonts etc. All theses (sub)requests are currently triggered with the scroll_end_of_page_script. In general the request for otto and zalando should also work without splash by using scrapy.Request, but since we currently do not have proxies this is the best workaround to make use of all IPs on the cluster. I did not delete scroll_end_of_page_script, so that we can use it in the future for new shops where we have to execute javascript.

A few test results (from 14.04.2022) between minimal script and scrolling script, by using har() in return of splash requests show the differences in terms of downloaded data and # of requests:

If you want to check the minimal_script functionality for Zalando, keep in mind that there is a new bug in the extractor due to a change in html, which is not related to the minimal script.

Otto extractor enhancements: While checking the functionality of the minimal script I realized that the otto extraction for a lot of products was not working, which is due to a failure in the schema extraction. Additionally the extraction failed sometimes due to a missing brand. I added the functionality to extract the brand from JSON_LD. The improvement achieves the following result on some sample data: