calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

otto next page extraction does not work as expected for some pages #120

Open BigDatalex opened 1 year ago

BigDatalex commented 1 year ago

I just noticed in the otto log files, that the next page extraction is not working as expected for some pages and the following error shows up:

2023-01-21 04:38:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.otto.de/schuhe/halbschuhe/?nachhaltigkeit=alle-nachhaltigen-artikel&zielgruppe=herren&l=gq&o=120 via http://splash:8050/execute> (referer: None)
[...]
  File "/tmp/scraping-1674237617-grym39ye.egg/scraping/spiders/otto_de.py", line 65, in parse_SERP
    if int(pagination_info["o"]) > response.meta.get("o", 0):
ValueError: invalid literal for int() with base 10: ''

This Error shows up 18 times in the log file. In #115 we updated the next page extraction in order to scrape products without filtering for sustainable products only. It could be that this change is the cause, but needs some inspection.

itrajanovska commented 1 year ago

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step? But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

BigDatalex commented 1 year ago

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115 https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/start_scripts/otto_de.py#L150-L153

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAYto speed up the scraping. For example by adding a custom setting as for zalando: https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/spiders/zalando_de.py#L20

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

BigDatalex commented 1 year ago

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step? But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

It is not affecting the extraction step, at least not in the first place ... The log file is from scrapyd and during the scraping process we extract the next pages.

itrajanovska commented 1 year ago

Ah sorry, I didn't notice this was another issue

itrajanovska commented 1 year ago

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115

https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/start_scripts/otto_de.py#L150-L153

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAYto speed up the scraping. For example by adding a custom setting as for zalando:

https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/spiders/zalando_de.py#L20

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

Right, and maybe for now we can skip the headphones and tvs as well, as I didn't expect they would have that big of an impact.

Update 17.02.2023 To tackle these comments, we created a new issue: https://github.com/calgo-lab/green-db/issues/126