Closed BigDatalex closed 2 years ago
If the HTML does not contain a sustainability information at all, I would drop it in the extraction step.
Would be good to know if there is/was a problem on the Amazon side (filtering) or the scraping failed to get the HTML properly. I created an issue to address that products without sustainability information should be dropped (#89). I would wait for another scraping run before we start investing the root cause, maybe it was just a Amazon problem ;)
In the last scraping run from
2022-08-19
294 of 1227 amazon products resulted in anUNKNOWN
certificate. This issue happens across all countries. Scraping the139
DE products of these again on2022-08-24
resulted in105
products that do not actually have a certificate at all and the rest actually having a certificate.I would suggest excluding these products from the extraction, which do not have sustainability information, because we normally use the certificate UNKNOWN only if a product has sustainability information, but we have no rule defined to map it. And for the minor part which gets added a sustainable label, I would hope for extracting this information in another (later) scraping run.
WDYT?