calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

`certificate:UNAVAILABLE` appears in otto fashion categories #119

Closed BigDatalex closed 1 year ago

BigDatalex commented 1 year ago

I just noticed that due to the last changes the UNAVAILABLE certificate shows up also in fashion categories. Not sure if that is a problem, but I thought it's a good idea to discuss that and be aware of it.

For example the following query: SELECT COUNT(id) AS "count", category FROM "green-db" WHERE "sustainability_labels" && '{certificate:UNAVAILABLE}'::TEXT[] AND "timestamp" = (SELECT MAX("timestamp") FROM "green-db") GROUP BY category Order BY "count" Desc;

returns (top 5): "count" "category" "11573" "SHOES" "5473" "UNDERWEAR" "3640" "HEADPHONES" "1496" "LAPTOP" "1093" "JACKET"

And regarding the #118 we should discuss whether we want to export products with that "label" to zenodo.

itrajanovska commented 1 year ago

Good point @BigDatalex , but I think this might be a biproduct of their html update which happened in the same week, I'm going to inspect that manually on several htmls from the fashion products. In that case it might be better to return unavailable only if the products come from a certain merchant & category, what do you think?

BigDatalex commented 1 year ago

@itrajanovska I think the best would be to have something working independent from the product category. Maybe we can access the request.meta within the extractor and check if it includes the SUSTAINABILITY_FILTER. We would then allow UNAVAILABLE only for products that were retrieved without using the SUSTAINABILITY_FILTER.

https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/spiders/_base.py#L307-L314

itrajanovska commented 1 year ago

@itrajanovska I think the best would be to have something working independent from the product category. Maybe we can access the request.meta within the extractor and check if it includes the SUSTAINABILITY_FILTER. We would then allow UNAVAILABLE only for products that were retrieved without using the SUSTAINABILITY_FILTER.

https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/spiders/_base.py#L307-L314

Thanks Alex,

But my assumption would be that before the UNAVAILABLE label was added, we scraped for (paginated) pages which contained products without any sustainability info, although we had our filter "?nachhaltigkeit=alle-nachhaltigen-artikel" set already. So in the past that would result in not passing the constraint for a non-empty sustainability_labels field, and thus those products were always scraped, but never extracted in the greendb.

So, in my opinion it would be better to resolve that (the pagination?) issue, and maybe avoid scraping unnecessary pages in the first place. What do you think?

Also, I might be wrong in case if that was the inital system design, and we scrape those pages intentionally. If that's the case then I'll deal with it by inspecting the sustainability filter in the url.

Update 17.02.2023

After doing a manual inspection on otto's webpage we concluded the following: When there's no pagination i.e the category has only a few products, there's a section in the bottom that appears from OTTO called Ähnliche Artikel. An example can be seen here: https://www.otto.de/schuhe/hausschuhe/?nachhaltigkeit=alle-nachhaltigen-artikel

In this Ähnliche Artikel section OTTO provides products from other categories as well, and most of the time they aren't even sustainable. In the past, these products were scraped but not extracted. But, since we added the feature of extracting unsustainable products from the ELECTRONICS category, it also affected these unsustainable FASHION products which sometimes appear under Ähnliche Artikel and thus are exported with a UNAVAILABLE label.

So the issue for the fashion products is handled here: https://github.com/calgo-lab/green-db/pull/121