Closed BigDatalex closed 2 years ago
For the linting of the base spider, I ended up ignoring a lot of mypy errors. There have been ignored a lot of errors before, so I thought this is ok, but maybe you can have a closer look on these.
Hopefully one last thing 😄
After building the scrapyd
container from scratch, it worked 🥳 However, the image URLs are broken. E.g.: //lp2.hm.com/hmgoepprod?set=quality%5B79%5D%2Csource%5B%2F1a%2Fbc%2F1abc29436b2dca2262a4fb7623e4874826a3cc88.jpg%5D%2Corigin%5Bdam%5D%2Ccategory%5B%5D%2Ctype%5BLOOKBOOK%5D%2Cres%5Bm%5D%2Chmver%5B1%5D&call=url[file:/product/main]
I implemented the suggested changes ;) So there is just one open thing regarding the Dockerfile, should we just revert the changes in https://github.com/calgo-lab/green-db/pull/72/commits/cde60c02ff2bd2cf1eb73ba4d8c01da835c59841?
You mean that it failed once the build the image properly?
I did not test building the image after your changes, but you mentioned that you had to build the image from scratch. So it just failed once, but worked the second time?
I guess that's just a problem on my end. So nothing to worry about ;)
This PR includes three things:
H&M shop: The H&M shop offers an API for SERP pages, which I am using. For PRODUCT pages the HTML is accessed. Even though H&M is top 13 in our potential french fashion targets, they do not use trustworthy sustainable labels see _LABEL_MAPPING In addition, there are two (private) labels
HM_CONSCIOUS
andHIGG_INDEX_MATERIALS
, which are extracted separately (not from the product's material description).My first spider tests resulted in getting blocked by h&m. Which was due to the facts:
Since for h&m it is not necessary to render the JS, to extract all product information, I restricted the requests being made to be of type
document
, which is similar to just using standard scrapy HttpRequest. But I had to add a new package chompjs which is capable of parsing "messy" JSON stored in a script tag.Regarding the time constraints of the robots.txt, I first just changed the overall start-job script to start at 0 am on Saturday instead of 6 pm on Friday, but since we are requesting for some shops about 100k requests e.g. zalando_de, this would condense the traffic being made into a smaller time window in order to finish before Monday morning. So instead of changing the start time of the cron job I moved this functionality into the h&m spider itself, which checks before every crawl if it has been started in the h&m crawl time window and if not awaits the necessary time.
https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/hm.py#L45-L56
Base spider changes: With the current ROUND_ROBIN iterator, starting one job per category of a merchant, ensuring to finish H&M jobs before the crawl time window ends, is not possible. Even just starting all h&m jobs first would interfer with the request-rate of the h&m robots.txt, because to the best of my knowledge, scrapyd does not ensure DOWNLOAD_DELAYS across spiders, even if the spider used in the job is of the same class. So in order to enable robots.txt-conform crawling, I did some changes in the base spider in order to start just one job per merchant. The most important changes are as follows:
poetry run scrapyd-client schedule -p scraping --arg timestamp='2022-06-06 12:22:23.935453' hm
. This is achieved by moving the merchant specific start-scripts from the start_script directory to the scraping directory and making the files accessible in the scrapyd "scraping" project. The structure now looks as follows:meta
parameter of each request is used to propagate the category information from the start_request to the SERP_requests and finally to the product_requests. To easily integrate this functionality for future shops I created the methodcreate_default_request_meta
, which needs to be used for every request, to propagate the category information as well as other meta information.https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/_base.py#L243-L249
scrapy-playwright: A few things to mention: