falcontechnologies / wex-2024

Work experience 2024
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Review the Rebrickable data set - Bolu #1

Closed psftc closed 2 weeks ago

psftc commented 2 weeks ago

The data set used is from the Rebrickable website here Rebrickable Downloads. This issue has several steps to complete

  1. Go to the downloads page above and identify the data files - we are interested only in the relational data and not the images. There should be 12 files, each of which ends with the file extension of '.gz'. Note the date of the files.
  2. Download each of these files manually to your local machine.
  3. Expand each file and review the data. You should look for the following things: header, number of columns, number of rows, types of data in the different columns, which columns are missing data
  4. The download page has a model shown in a diagram. Review this given the data you've observed. When reviewing, consider mentally connecting theme, set, inventory, parts.
  5. In this ticket, add several paragraphs about your observations which may include additional questions.
Boom244 commented 2 weeks ago

Analysis of Rebrickable dataset

Data Layout

As is indicated on the downloads page of Rebrickable, the data in these files is laid out in a node graph in terms of increasing data specificity.

For example, if you wanted to look up the LEGO set that inventory 199628 belongs to, you would consult column id of inventory.csv, and find that 199628's corresponding set_num is 42605-1.

Supposing you wanted to find out more about this set, you'd head down the graph to sets.csv. Consulting the set_num column for 42605-1 yields that its name is "Mars Space Base and Rocket", it has 981 parts, and also a link to an image of the built set on Rebrickable's CDN.

Going deeper, we can also see the individual parts in this set by going deeper into the graph, into inventory_parts.csv (If this were any other set, you could also go back up to inventory.csv and search for our set number in the set_num column to see if there is more than 1 inventory that belongs to it, but 42605-1 only has one) and searching for inventory 199628 in the inventory_id column. There are a total of 382 parts that belong to this inventory. Let's zero in on part_num 10106055.

Traversing the graph further, we can drop into parts.csv, where we can search for part_num 10106055, finding out its name, "Sticker Sheet for Set 42605-1". Lastly, we can drop into our last node, part_categories.csv and use 10106055's part_cat_id in the node's id column to find the part category, "Stickers".

Empty Columns

There are several instances of empty columns within the dataset. I wrote a quick analyzer in Python to index these empty columns.

Data categorization

The data in each of these CSV files is categorized into either strings (primarily for names and Set IDs), enum-likes, numbers, or booleans. One of these that caught my attention was the categorization of rel_types in part_relationships.csv, where P stands for Print, R for Pair, B for Subpart, M for Mold, T for Pattern, and A for alternate. This acronymization is not present in parts.csv, where materials are listed in full, despite there being more materials than possible part relationships.

Miscellaneous Observations

Lastly, Questions