calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Add documentation about categories #87

Closed BigDatalex closed 1 year ago

BigDatalex commented 1 year ago

If the GS1 GPC migration is taking some more time it might be good to already document our categories and the corresponding hierarchy. For example, we have the more general category BAG and its "subcategory" BACKPACKS.

If we would agree on some standards on which category to use if multiple categories are assigned to a product. For example, if a product is assigned to BAG and BACKPACK, one should use only BACKPACK, this would simplify the assignment of categories. For reference see: https://github.com/calgo-lab/green-db/pull/83/files#r938658607

en-GB commented 1 year ago

If we would agree on some standards on which category to use if multiple categories are assigned to a product. For example, if a product is assigned to BAG and BACKPACK, one should use only BACKPACK, this would simplify the assignment of categories. For reference see: https://github.com/calgo-lab/green-db/pull/83/files#r938658607

I originally suggested this to fix some problems with otto.de and that would still be the only shop where its useful. There is a much simpler solution for otto where each product remembers the length of the original SERP url that led to it being scraped. Then if you scrape the same product twice you simply keep the one with the shortest SERP url.

This could be used in future shops too, allthough you would probably call the field something like mapping_weight instead of length_of_serp_url.

There are many cases where merging based on category does not work, like Bathrobes being listed under Swimwear. We map Bathrobes to NIGHTWEAR. but having NIGHTWEAR be a subcategory of SWIMWEAR on our end would not make sense. in fact it would probably break things.

BigDatalex commented 1 year ago

You are right, there are cases where this would not be possible, but I think making use of the hierarchy would simplify the process for a lot of categories. And it would be great to actually document these special cases where this is not working and one would need to specify all subcategories. For example, the Bathrobe NIGHTWEAR example you gave, which occurs in some shops.

In addition, if we have overlooked something and did not assign proper mappings for merchant categories or in the case when the merchant does not provide clean categories and BACKPACKS are nonetheless part of BAG category we would be able to retrieve the actual (more specific) category. I think currently all products with multiple categories are excluded in the export step because we do not have defined anything yet.

If we want to keep track of the original SERP page/ category, I would actually prefer to store the name of the merchant's category, thus we would be able to reassign product categories at some point, if we want to switch to/ add more fine-grained categories. And of course, this could be also used for assigning the most specific category. In my eyes, this would be a nice feature, but as far as I remember, I already suggested this in a meeting, some time ago, but we came to the decision that we want to focus more on the sustainability information instead of categories, don't know if this has changed in the meantime with the decision to move towards GPC.

se-jaeger commented 1 year ago

There is a much simpler solution for otto where each product remembers the length of the original SERP url that led to it being scraped. Then if you scrape the same product twice you simply keep the one with the shortest SERP url.

Hui, this sounds really hacky.

Agree with @BigDatalex initial suggestion. We should definitively make use of the hierarchy and assign the most specific category, which implicitly means all parent categories are assigned as well. I'm more than happy to re-assign those weird cases (e.g., Bathrobes - SWIMWEAR - NIGHTWEAR) and prefer a clean structure. Further, from my point of view, we should use GPC's instead of the shops' hierarchy and aim for simplicity whenever possible. Maintaining many special cases should be avoided.

en-GB commented 1 year ago

I think both of these where my suggestions initially. but to clarify what i mean, i dont actually think either is necessary right now. Just accepting that we wont get sneakers/tshirts from otto is perfectly reasonable.

If we do end up needing tie breaking, we should let each startjob decide individually how important each mapping is. with eg. a mapping_weight: float or whatever. This is actually easy on the authoring side. For all shops we have implemented right now except otto, the weight would always be zero.

On otto (and any other shop where categories are nested paths), setting the mapping weight to the length of the path is enough. If A is a subpath of B then B is a proper prefix of A so A is longer than B. So merging based on path length will always preserve the most specific mapping.

BigDatalex commented 1 year ago

but to clarify what i mean, i dont actually think either is necessary right now. Just accepting that we wont get sneakers/tshirts from otto is perfectly reasonable.

Hmm ... I am a little confused now 😖 In your current PR #83 we currently actually get SNEAKERS and T-SHIRT from otto: https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L39 https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L55

and by "get" I mean that we scrape these products and store them in the database. Do you mean that we do not "get" them, because in the export step we are excluding all products that are assigned to multiple categories?

And the multiple categories are assigned, because we use SHIRT for the general category "mode/shirts/" which includes basically "mode/shirts/t-shirts/", right? https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L54-L55

en-GB commented 1 year ago

Do you mean that we do not "get" them, because in the export step we are excluding all products that are assigned to multiple categories?

And the multiple categories are assigned, because we use SHIRT for the general category "mode/shirts/" which includes basically "mode/shirts/t-shirts/", right?

thats correct.

se-jaeger commented 1 year ago

Ok now I understand 😅

Probably the easiest solution here is to only map the most specific categories of the shops. So we don't use mode/shirts but map all mode/shirts/whatever. We will have many mappings to SHIRT but this does not hurt. Is this feasibly or are there too many categories?

se-jaeger commented 1 year ago

As discussed offline, we will move to GS1 taxonomy soon. This requires documentation, so this will be part of it.

Regarding mode/shirts and mode/shirts/whatever, for all scrapes but Otto, we only use the most specific categories, i.e., mode/shirts/whatever. However, Otto does not distinct between those, which is why we stick to scraping both, even though this will end up in "duplicated" products that have both categories, SHIRT and TSHIRT.