calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Zalando extractor fails to extract sustainability labels #74

Closed en-GB closed 1 year ago

en-GB commented 2 years ago

We used to extract labels directly from the rendered HTML. Since Splash is no longer able to render zalando product pages, we extract them from this json file https://github.com/calgo-lab/green-db/blob/bf77115617dc68cc91ba6c2cfdb3a79588ec0e26/extract/extract/extractors/zalando.py#L152 but occasionally some labels will be missing. This only affects ~10 products in any given run and ive only seen it happen on zalando.co.uk.

Switching the zalando scraper to Playwright would probably fix it tho.

se-jaeger commented 2 years ago

With the latest changes from #79 the extractor can't find any sustainability labels leading to not create products.

BigDatalex commented 2 years ago

I just updated the zalando extractor. It was just a minor change, due to a change of a class name in the html. See: https://github.com/calgo-lab/green-db/commit/53601e8d5c9908e63c04559bdd5d5bc806753471

BigDatalex commented 2 years ago

There are two commits from @en-GB that might be more robust and improve the extraction of the zalando sustainability-labels see:

We (@en-GB) should check if these behave the same (extract the same sustainability-labels) like in the original approach or if there are some implications. So far, for our zalando tests, these achieve the same results.

se-jaeger commented 1 year ago

@en-GB what's the status about this one? Ist this still an issue or can we just close it? Especially after the lates changes #83