calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

How to handle size information if we retrieve more than 1 size per offer/product? #55

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

Originally the "size" column was intended to cover just one size for each product, but in case of asos the product URL does not change if a different size is chosen by a user. So currently all available sizes for a product are joined together as a string using comma-seperation and stored in the "size" column of the database.

https://github.com/calgo-lab/green-db/blob/e3c89d6e2453472759edc371bbbc359fa2503025/extract/extract/extractors/asos.py#L47-L48

The comma-separation method is not good, in case we have e.g. shoe sizes with commas and we want to extract the different shoe sizes from the column. So we should think of a better method. Maybe we also need to change the column format to lists in here:

https://github.com/calgo-lab/green-db/blob/e3c89d6e2453472759edc371bbbc359fa2503025/database/database/tables.py#L149

like we did for the images:

https://github.com/calgo-lab/green-db/blob/e3c89d6e2453472759edc371bbbc359fa2503025/database/database/tables.py#L146

se-jaeger commented 2 years ago

Offline discussions came to this:

A single row in the GreenDB then represents a product for a given market. This means, the attributes show which values are available and the URL points to one (potentially random) offer of these variations.

se-jaeger commented 2 years ago

Also add the column country of type string. It should follow these County Codes: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements

BigDatalex commented 2 years ago

When changing the table structure we have to redeploy the DB or add the new table manually (without redeployment). How do we handle the data of the old table structure? Will we just "archive" it e.g. by modifying the old table name or (try) to transform it into the new table structure? Transforming the old format into the new one should just be an issue for Asos and Amazon because they are the only ones that use the size column. Transformation of the color column should be possible for all merchants because we do not use string concatenations for this one.

se-jaeger commented 2 years ago

Good to bring this up!

I would definitely not archive them. Let's try as good as we can to transform them. As you said, there are just a few scenarios we should take care of. It should be possible to alter the column type and "cast" strings into array[string] as a first step to change the DB schema. In a second step, we could read all data into a python script, transform the color strings into list and write back.

BigDatalex commented 2 years ago

Ok, sounds reasonable. What about the gender/age attribute? Was there already a final decision on implementing this and how the format should look like?