calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

otto filter '?nachhaltigkeit=alle-nachhaltigen-artikel' not available anymore #149

Closed BigDatalex closed 10 months ago

BigDatalex commented 10 months ago

Otto does not support the option ?nachhaltigkeit=alle-nachhaltigen-artikel in URLs anymore to filter for all sustainable products. So our requests get redirected to the standard category with all products, which results in products being scraped that are not sustainable and also a longer scraping time.

We need to change the filter variable in here: https://github.com/calgo-lab/green-db/blob/9534d767f3edcfc78a0390949072e01b20be86e2/scraping/scraping/start_scripts/otto_de.py#L6

to all options that are available on otto. These are for example the ones for the blouse category:

nachhaltigkeit=foerderung-sozialer-initiativen,kreislauffaehiges-design,materialien-aus-biologischem-anbau,naturkosmetik,recycelte-materialien,verbesserte-herstellung,verbesserte-rohstoffbeschaffung

but probably there are some additional ones on the other categories. This needs to be investigated. Maybe @en-GB or @AdriaSG can have a look at this?

en-GB commented 10 months ago

looks like this endpoint lists all available filters https://www.otto.de/leafcutter/filters?rule=(und.(ist.nachhaltigkeit._).(~.(v.1)))&fc=

so SUSTAINABILITY_FILTER = "?nachhaltigkeit=beruecksichtigt-tierwohl,energieeffiziente-nutzung,foerderung-sozialer-initiativen,kreislauffaehiges-design,materialien-aus-biologischem-anbau,naturkosmetik,recycelte-materialien,verbesserte-herstellung,verbesserte-rohstoffbeschaffung"?