Closed BigDatalex closed 1 year ago
If we would agree on some standards on which category to use if multiple categories are assigned to a product. For example, if a product is assigned to BAG and BACKPACK, one should use only BACKPACK, this would simplify the assignment of categories. For reference see: https://github.com/calgo-lab/green-db/pull/83/files#r938658607
I originally suggested this to fix some problems with otto.de and that would still be the only shop where its useful. There is a much simpler solution for otto where each product remembers the length of the original SERP url that led to it being scraped. Then if you scrape the same product twice you simply keep the one with the shortest SERP url.
This could be used in future shops too,
allthough you would probably call the field something like mapping_weight
instead of length_of_serp_url
.
There are many cases where merging based on category does not work, like Bathrobes being listed under Swimwear. We map Bathrobes to NIGHTWEAR. but having NIGHTWEAR be a subcategory of SWIMWEAR on our end would not make sense. in fact it would probably break things.
You are right, there are cases where this would not be possible, but I think making use of the hierarchy would simplify the process for a lot of categories. And it would be great to actually document these special cases where this is not working and one would need to specify all subcategories. For example, the Bathrobe NIGHTWEAR example you gave, which occurs in some shops.
In addition, if we have overlooked something and did not assign proper mappings for merchant categories or in the case when the merchant does not provide clean categories and BACKPACKS are nonetheless part of BAG category we would be able to retrieve the actual (more specific) category. I think currently all products with multiple categories are excluded in the export step because we do not have defined anything yet.
If we want to keep track of the original SERP page/ category, I would actually prefer to store the name of the merchant's category, thus we would be able to reassign product categories at some point, if we want to switch to/ add more fine-grained categories. And of course, this could be also used for assigning the most specific category. In my eyes, this would be a nice feature, but as far as I remember, I already suggested this in a meeting, some time ago, but we came to the decision that we want to focus more on the sustainability information instead of categories, don't know if this has changed in the meantime with the decision to move towards GPC.
There is a much simpler solution for otto where each product remembers the length of the original SERP url that led to it being scraped. Then if you scrape the same product twice you simply keep the one with the shortest SERP url.
Hui, this sounds really hacky.
Agree with @BigDatalex initial suggestion. We should definitively make use of the hierarchy and assign the most specific category, which implicitly means all parent categories are assigned as well. I'm more than happy to re-assign those weird cases (e.g., Bathrobes - SWIMWEAR - NIGHTWEAR) and prefer a clean structure. Further, from my point of view, we should use GPC's instead of the shops' hierarchy and aim for simplicity whenever possible. Maintaining many special cases should be avoided.
I think both of these where my suggestions initially. but to clarify what i mean, i dont actually think either is necessary right now. Just accepting that we wont get sneakers/tshirts from otto is perfectly reasonable.
If we do end up needing tie breaking, we should let each startjob decide individually how important each mapping is. with eg. a mapping_weight: float
or whatever. This is actually easy on the authoring side. For all shops we have implemented right now except otto, the weight would always be zero.
On otto (and any other shop where categories are nested paths), setting the mapping weight to the length of the path is enough. If A is a subpath of B then B is a proper prefix of A so A is longer than B. So merging based on path length will always preserve the most specific mapping.
but to clarify what i mean, i dont actually think either is necessary right now. Just accepting that we wont get sneakers/tshirts from otto is perfectly reasonable.
Hmm ... I am a little confused now 😖 In your current PR #83 we currently actually get SNEAKERS
and T-SHIRT
from otto:
https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L39
https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L55
and by "get" I mean that we scrape these products and store them in the database. Do you mean that we do not "get" them, because in the export step we are excluding all products that are assigned to multiple categories?
And the multiple categories are assigned, because we use SHIRT
for the general category "mode/shirts/"
which includes basically "mode/shirts/t-shirts/"
, right?
https://github.com/calgo-lab/green-db/blob/b6af49b1c88354162048b1438c5634b95504d266/scraping/scraping/start_scripts/otto_de.py#L54-L55
Do you mean that we do not "get" them, because in the export step we are excluding all products that are assigned to multiple categories?
And the multiple categories are assigned, because we use
SHIRT
for the general category"mode/shirts/"
which includes basically"mode/shirts/t-shirts/"
, right?
thats correct.
Ok now I understand 😅
Probably the easiest solution here is to only map the most specific categories of the shops. So we don't use mode/shirts
but map all mode/shirts/whatever
. We will have many mappings to SHIRT
but this does not hurt. Is this feasibly or are there too many categories?
As discussed offline, we will move to GS1 taxonomy soon. This requires documentation, so this will be part of it.
Regarding mode/shirts
and mode/shirts/whatever
, for all scrapes but Otto, we only use the most specific categories, i.e., mode/shirts/whatever
. However, Otto does not distinct between those, which is why we stick to scraping both, even though this will end up in "duplicated" products that have both categories, SHIRT
and TSHIRT
.
If the GS1 GPC migration is taking some more time it might be good to already document our categories and the corresponding hierarchy. For example, we have the more general category
BAG
and its "subcategory"BACKPACKS
.If we would agree on some standards on which category to use if multiple categories are assigned to a product. For example, if a product is assigned to BAG and BACKPACK, one should use only BACKPACK, this would simplify the assignment of categories. For reference see: https://github.com/calgo-lab/green-db/pull/83/files#r938658607