calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Amazon 'description' can contain an empty string #81

Closed BigDatalex closed 1 year ago

BigDatalex commented 2 years ago

I just noticed, that the description for a lot of amazon products contains an empty string. Is this intentional? A quick analysis showed that we have an empty description for 2437 of 10393 unique urls/ products.

The empty string is set in this return method: https://github.com/calgo-lab/green-db/blob/258403381ee5916527a770f6f7e74a540f9d5cdb/extract/extract/extractors/amazon.py#L288-L296

This is one example product where this occured. Maybe it is possible to extend the description extractor and extract in case no description is found the bullet points ("À propos de cet article"/ "About this product") ?

felixbiessmann commented 2 years ago

Yes, this is often the case - the relevant information is always the union of title, description and bulletpoints (which is the case in the above example)

BigDatalex commented 2 years ago

In addition, the above product includes the indice de réparabilité. So far we do not have this in our sustainability-labels.json, but maybe its worth adding this? Seems to be similar to ifixit Reparierbarkeitsindex. We probably did not see this one yet, because it is not listed in the climate pledge section.

felixbiessmann commented 2 years ago

that's amazing, would be great to have that in our data base, wasn't aware this is already available ... but maybe it's a bit more complicated to express that in our current label schema?

BigDatalex commented 2 years ago

I just sent it to Lena, so that she can add it to our list ;) But this label provides scores as doubles, so far we just had integers with a few levels. So it might be a little tricky to integrate and also to evaluate these scores for our sustainability-labels-evaluation.csv if we want to use the same precision. Lets wait for Lena's opinion, on how credible this one is and then we will see :)

BigDatalex commented 1 year ago

Sometimes amazon products neither have bullet points nor a product description. For example see here. So it is actually possible to have an empty description.

If we do not want empty descriptions we could just use the product name again as a description or concatenate for example the name, price and currency.

felixbiessmann commented 1 year ago

Often the title is descriptive enough. I’d suggest to keep the option of having an empty string as description in the GreenDB and when using the description for example for ML prediction tasks, I’d concatenate the title and description.