Add additional shop H&M and integrate scrapy-playwright

BigDatalex commented 2 years ago

This PR includes three things:

first the integration of a new shop H&M for french fashion products,
second, the integration of a headless browser Playwright which can be used to render JS. For accessing the browser the library scrapy-playwright is used.
And third a change to the base spider in order to just start one job per merchant.

H&M shop: The H&M shop offers an API for SERP pages, which I am using. For PRODUCT pages the HTML is accessed. Even though H&M is top 13 in our potential french fashion targets, they do not use trustworthy sustainable labels see _LABEL_MAPPING In addition, there are two (private) labels HM_CONSCIOUS and HIGG_INDEX_MATERIALS, which are extracted separately (not from the product's material description).

My first spider tests resulted in getting blocked by h&m. Which was due to the facts:

default settings: I used default scrapy-playwright settings and rendered the full page including images, videos etc. and
time related: the h&m robots.txt allows scraping just between 0 to 9 (in the following referred to as h&m crawl time window)

Since for h&m it is not necessary to render the JS, to extract all product information, I restricted the requests being made to be of type document, which is similar to just using standard scrapy HttpRequest. But I had to add a new package chompjs which is capable of parsing "messy" JSON stored in a script tag.

Regarding the time constraints of the robots.txt, I first just changed the overall start-job script to start at 0 am on Saturday instead of 6 pm on Friday, but since we are requesting for some shops about 100k requests e.g. zalando_de, this would condense the traffic being made into a smaller time window in order to finish before Monday morning. So instead of changing the start time of the cron job I moved this functionality into the h&m spider itself, which checks before every crawl if it has been started in the h&m crawl time window and if not awaits the necessary time.

https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/hm.py#L45-L56

Base spider changes: With the current ROUND_ROBIN iterator, starting one job per category of a merchant, ensuring to finish H&M jobs before the crawl time window ends, is not possible. Even just starting all h&m jobs first would interfer with the request-rate of the h&m robots.txt, because to the best of my knowledge, scrapyd does not ensure DOWNLOAD_DELAYS across spiders, even if the spider used in the job is of the same class. So in order to enable robots.txt-conform crawling, I did some changes in the base spider in order to start just one job per merchant. The most important changes are as follows:

To scrape all categories of a merchant, the spider can be started just by: poetry run scrapyd-client schedule -p scraping --arg timestamp='2022-06-06 12:22:23.935453' hm. This is achieved by moving the merchant specific start-scripts from the start_script directory to the scraping directory and making the files accessible in the scrapyd "scraping" project. The structure now looks as follows:

├───scraping
│   └───scraping
│       ├───data
│       ├───spiders
│       ├───start_scripts

If just one category should be scraped (e.g. for developing purposes), the approach works as before and the command will be translated into one setting which will be scraped.
In order to save the products category the meta parameter of each request is used to propagate the category information from the start_request to the SERP_requests and finally to the product_requests. To easily integrate this functionality for future shops I created the method create_default_request_meta, which needs to be used for every request, to propagate the category information as well as other meta information.

https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/_base.py#L243-L249

scrapy-playwright: A few things to mention:

scrapy-playwright is currently not working on windows (out of the box), but one can use playwright itself to check requests in the browser.
I noticed differences when using different browsers e.g. chrome and firefox and also differences in using the browser in headless or non-headless mode.
In the docker image I just installed chromium to keep the docker image size small. Image size without playwright was 439.99 MB and now it's about 1.4 GB

BigDatalex commented 2 years ago

For the linting of the base spider, I ended up ignoring a lot of mypy errors. There have been ignored a lot of errors before, so I thought this is ok, but maybe you can have a closer look on these.

se-jaeger commented 2 years ago

Hopefully one last thing 😄

After building the scrapyd container from scratch, it worked 🥳 However, the image URLs are broken. E.g.: //lp2.hm.com/hmgoepprod?set=quality%5B79%5D%2Csource%5B%2F1a%2Fbc%2F1abc29436b2dca2262a4fb7623e4874826a3cc88.jpg%5D%2Corigin%5Bdam%5D%2Ccategory%5B%5D%2Ctype%5BLOOKBOOK%5D%2Cres%5Bm%5D%2Chmver%5B1%5D&call=url[file:/product/main]

BigDatalex commented 2 years ago

I implemented the suggested changes ;) So there is just one open thing regarding the Dockerfile, should we just revert the changes in https://github.com/calgo-lab/green-db/pull/72/commits/cde60c02ff2bd2cf1eb73ba4d8c01da835c59841?

se-jaeger commented 2 years ago

You mean that it failed once the build the image properly?

BigDatalex commented 2 years ago

I did not test building the image after your changes, but you mentioned that you had to build the image from scratch. So it just failed once, but worked the second time?

se-jaeger commented 2 years ago

I guess that's just a problem on my end. So nothing to worry about ;)

calgo-lab / green-db

Add additional shop H&M and integrate scrapy-playwright #72