falcontechnologies / wex-2024

Work experience 2024
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Review the Rebrickable data set - Danvir #3

Open psftc opened 4 weeks ago

psftc commented 4 weeks ago

The data set used is from the Rebrickable website here Rebrickable Downloads. This issue has several steps to complete

Go to the downloads page above and identify the data files - we are interested only in the relational data and not the images. There should be 12 files, each of which ends with the file extension of '.gz'. Note the date of the files. Download each of these files manually to your local machine. Expand each file and review the data. You should look for the following things: header, number of columns, number of rows, types of data in the different columns, which columns are missing data The download page has a model shown in a diagram. Review this given the data you've observed. When reviewing, consider mentally connecting theme, set, inventory, parts. In this ticket, add several paragraphs about your observations which may include additional questions.

Danvir-j commented 3 weeks ago

Elements Data:

number of columns: 89089, number of rows: 4, types of data in the different columns: Element id/ part number/ color id/ design id

Elements in Lego are used to refer to a specific piece/ part of Legos, for example an element id of 4566309 refers to a black boat anchor, with the data of 2564/0/2564. Initially, I was unsure of what design idea meant, but you explained it as different variations/versions of the same part. Some design ids missing, indicating some parts never received another variation/version.

Themes:

number of columns: 461, number of rows: 3, types of data in the different columns: Theme id/ name/ parent id

Themes are used to help categorize sets. The id and name are self-explanatory, but the parent id is used if there is already a broader theme category that the theme could be included in. For example, Racers and Ferrari are two separate themes but Ferrari has the parent id of Racer as Ferrari could also be listed under Racer and Racer has no parent id as there is no such existing broader theme it could be listed under.

Color:

number of columns: 268, number of rows: 4, types of data in the different columns: color id/color name/ color on RGB/ translucent

The way color is categorized in Lego is by an id. In the case of black the data would look like 0/Black/ 05131D/false.

Parts categories:

number of columns: 69, number of rows: 2, types of data in the different columns: id/name

Part categories are used to categorize parts based on their properties, for example all bricks fall under the category of 11 no matter their size or color, so the data in this case would be 11/brick.

Parts:

number of columns: 54498, number of rows: 4, types of data in the different columns: part number/ name/ part category/ part material

Similar to elements each part has its unique data. For example, a 1 by 2 brick has the data of 11211pr0001 /1 x 2 brick/11/plastic.

Parts relationships:

number of columns: 30989, number of rows: 3, types of data in the different columns: relationship type/child part/ parent part

Parts grouped together based on relationships and similarities but not close enough to be considered in the same category. For an example we can look at R/98653pr0003/98086pr0003, the parent part being the head of a pterodactyl and the child part being the body. Both parts of different category but hold similarities, however I was unable to determine how relationship type is chosen.

Sets:

number of columns: 22680, number of rows: 6, types of data in the different columns: set number/ name/ year/ theme/ number of parts/ URL of image

Minifigs:

number of columns: 14356, number of rows: 4, types of data in the different columns: fig number/ name/ number of parts/URL

Inventories:

number of columns: 38738, number of rows: 3, types of data in the different columns: id/version/ set number

Inventories are the versions of a Lego set.

Inventory minifigs:

number of columns: 21804, number of rows: 4, types of data in the different columns: inventory/set by Inventory / fig number/ quantity of figs

Inventory minifigs are used to determine the number of mini figures in the set.

Inventory parts:

number of columns: 1048576, number of rows: 6, types of data in the different columns: inventory/ part number/ color id/ quantity/ spare/ URL

The Inventory parts gives a detailed overview of each piece in the inventory. A few image URLs are missing, which I find odd as no other data sheet has missing URLs.

Inventory sets:

number of columns: 4432, number of rows: 3, types of data in the different columns: inventory/set number/ quantity

Total number of versions each set has.

Reviewing model:

Reviewing the model/diagram after analyzing the data, I realize it resembles a node graph and the data is interconnected. What I mean by this is each node has data from another node, for example the set node uses data from themes, while the inventory and inventory set nodes both use data from the set node. Additionally, we can see that parts and inventory are connected through the middleman of inventory parts, which uses both inventory and parts in its data. Some other observations are that the graph is a bit outdated as the node Element does not have a data row for design idea, unlike the data charts. Finally, I noticed that some nodes had a specific id for specific data sets (consisting of Inventories, Minifigs, Sets, Parts, Parts categories, Color, Elements, themes ) while others were blank (consisting of Inventory sets, Inventory parts, Inventory minifigs, Parts relationships) the correlation was that the one with specific ids were used as data in other nodes except for element and the ones that were blank were stand alone and not used for data.