calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Add additional shop H&M and integrate scrapy-playwright #72

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

This PR includes three things:

  1. first the integration of a new shop H&M for french fashion products,
  2. second, the integration of a headless browser Playwright which can be used to render JS. For accessing the browser the library scrapy-playwright is used.
  3. And third a change to the base spider in order to just start one job per merchant.

H&M shop: The H&M shop offers an API for SERP pages, which I am using. For PRODUCT pages the HTML is accessed. Even though H&M is top 13 in our potential french fashion targets, they do not use trustworthy sustainable labels see _LABEL_MAPPING In addition, there are two (private) labels HM_CONSCIOUS and HIGG_INDEX_MATERIALS, which are extracted separately (not from the product's material description).

My first spider tests resulted in getting blocked by h&m. Which was due to the facts:

Since for h&m it is not necessary to render the JS, to extract all product information, I restricted the requests being made to be of type document, which is similar to just using standard scrapy HttpRequest. But I had to add a new package chompjs which is capable of parsing "messy" JSON stored in a script tag.

Regarding the time constraints of the robots.txt, I first just changed the overall start-job script to start at 0 am on Saturday instead of 6 pm on Friday, but since we are requesting for some shops about 100k requests e.g. zalando_de, this would condense the traffic being made into a smaller time window in order to finish before Monday morning. So instead of changing the start time of the cron job I moved this functionality into the h&m spider itself, which checks before every crawl if it has been started in the h&m crawl time window and if not awaits the necessary time.

https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/hm.py#L45-L56

Base spider changes: With the current ROUND_ROBIN iterator, starting one job per category of a merchant, ensuring to finish H&M jobs before the crawl time window ends, is not possible. Even just starting all h&m jobs first would interfer with the request-rate of the h&m robots.txt, because to the best of my knowledge, scrapyd does not ensure DOWNLOAD_DELAYS across spiders, even if the spider used in the job is of the same class. So in order to enable robots.txt-conform crawling, I did some changes in the base spider in order to start just one job per merchant. The most important changes are as follows:

├───scraping
│   └───scraping
│       ├───data
│       ├───spiders
│       ├───start_scripts

https://github.com/calgo-lab/green-db/blob/2db71eff7f7d0226b2af277dc057c3cc14182693/scraping/scraping/spiders/_base.py#L243-L249

scrapy-playwright: A few things to mention:

BigDatalex commented 2 years ago

For the linting of the base spider, I ended up ignoring a lot of mypy errors. There have been ignored a lot of errors before, so I thought this is ok, but maybe you can have a closer look on these.

se-jaeger commented 2 years ago

Hopefully one last thing 😄

After building the scrapyd container from scratch, it worked 🥳 However, the image URLs are broken. E.g.: //lp2.hm.com/hmgoepprod?set=quality%5B79%5D%2Csource%5B%2F1a%2Fbc%2F1abc29436b2dca2262a4fb7623e4874826a3cc88.jpg%5D%2Corigin%5Bdam%5D%2Ccategory%5B%5D%2Ctype%5BLOOKBOOK%5D%2Cres%5Bm%5D%2Chmver%5B1%5D&call=url[file:/product/main]

BigDatalex commented 2 years ago

I implemented the suggested changes ;) So there is just one open thing regarding the Dockerfile, should we just revert the changes in https://github.com/calgo-lab/green-db/pull/72/commits/cde60c02ff2bd2cf1eb73ba4d8c01da835c59841?

se-jaeger commented 2 years ago

You mean that it failed once the build the image properly?

BigDatalex commented 2 years ago

I did not test building the image after your changes, but you mentioned that you had to build the image from scratch. So it just failed once, but worked the second time?

se-jaeger commented 2 years ago

I guess that's just a problem on my end. So nothing to worry about ;)