calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Increase delay for Amazon page rendering? #128

Closed itrajanovska closed 1 year ago

itrajanovska commented 1 year ago

In the scrape from 10.02.23 we got a lot of amazon products with the following image urls: https://m.media-amazon.com/images/W/IMAGERENDERINGjpg This has happened before as well but now it seems to affected 9 times more products than what happened sometimes in the past.

1,2022-10-07 18:00:11.868796,55
2,2022-10-14 18:00:12.435032,54
3,2023-02-10 18:00:47.596703,457

Maybe we should increase the delay for rendering those pages?

BigDatalex commented 1 year ago

So far it was not necessary to render the page or images, because we are only interested in the url. All spiders use the minimal_script which disables all rendering, see: https://github.com/calgo-lab/green-db/blob/302c6ebb27bcd387dbcf37004e4bde28114531d7/scraping/scraping/splash.py#L31-L35 .

https://github.com/calgo-lab/green-db/blob/302c6ebb27bcd387dbcf37004e4bde28114531d7/scraping/scraping/spiders/amazon_de.py#L114