calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Amazon CPF products html sometimes does not list any `sustainability_labels` #88

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

In the last scraping run from 2022-08-19 294 of 1227 amazon products resulted in an UNKNOWN certificate. This issue happens across all countries. Scraping the 139 DE products of these again on 2022-08-24 resulted in 105 products that do not actually have a certificate at all and the rest actually having a certificate.

I would suggest excluding these products from the extraction, which do not have sustainability information, because we normally use the certificate UNKNOWN only if a product has sustainability information, but we have no rule defined to map it. And for the minor part which gets added a sustainable label, I would hope for extracting this information in another (later) scraping run.

WDYT?

se-jaeger commented 2 years ago

If the HTML does not contain a sustainability information at all, I would drop it in the extraction step.

Would be good to know if there is/was a problem on the Amazon side (filtering) or the scraping failed to get the HTML properly. I created an issue to address that products without sustainability information should be dropped (#89). I would wait for another scraping run before we start investing the root cause, maybe it was just a Amazon problem ;)