calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Zalando extractor fails #62

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

Error in zalando extractor for all products of category shoes (most likely that it occurs also in all other categories):

14:44:27 extract: workers.extract.extract_and_write_to_green_db('zalando., 569) (fb132e4a-b5ff-4685-8ae3-7d80f0847ab0) 
2022-04-21 14:44:28,097 - INFO - extract.extractors.zalando: 1 validation error for Product sustainability_labels ensure this value has at least 1 items (type=value_error.list.min_items; limit_value=1) 

The received data is not in the expected format our extractor relies on. The schema.org extraction is not affected but the extraction of the sustainable information.

en-GB commented 2 years ago

i cant replicate this

en-GB commented 2 years ago

when i go to this page: https://www.zalando.fr/adidas-originals-stan-smith-unisex-baskets-basses-white-ad115o181-a11.html and search for certificate__title in the browser dev tools i get one hit. however if i do it with js disabled i get nothing.

this tells me that the sustainability info is filled in by some script.

BigDatalex commented 2 years ago

For replication of this issue you have to use splash. So you can either start a scraping job and check the results in the DB or you do it via scrapy shell and yield a request via splash.