Analysis of Rebrickable dataset

Data Layout

As is indicated on the downloads page of Rebrickable, the data in these files is laid out in a node graph in terms of increasing data specificity.

For example, if you wanted to look up the LEGO set that inventory 199628 belongs to, you would consult column id of inventory.csv, and find that 199628's corresponding set_num is 42605-1.

Supposing you wanted to find out more about this set, you'd head down the graph to sets.csv. Consulting the set_num column for 42605-1 yields that its name is "Mars Space Base and Rocket", it has 981 parts, and also a link to an image of the built set on Rebrickable's CDN.

Going deeper, we can also see the individual parts in this set by going deeper into the graph, into inventory_parts.csv (If this were any other set, you could also go back up to inventory.csv and search for our set number in the set_num column to see if there is more than 1 inventory that belongs to it, but 42605-1 only has one) and searching for inventory 199628 in the inventory_id column. There are a total of 382 parts that belong to this inventory. Let's zero in on part_num 10106055.

Traversing the graph further, we can drop into parts.csv, where we can search for part_num 10106055, finding out its name, "Sticker Sheet for Set 42605-1". Lastly, we can drop into our last node, part_categories.csv and use 10106055's part_cat_id in the node's id column to find the part category, "Stickers".

Empty Columns

There are several instances of empty columns within the dataset. I wrote a quick analyzer in Python to index these empty columns.

Of the ~1.2 million rows in inventory_parts.csv, ~7,800 of them are missing img_url entries, likely due to gaps in the sourcing for images of every catalogued part.
For elements.csv's ~89,000 rows, it is missing ~18,600 design_id entries.
themes.csv is missing 137 out of 460 parent_id entries.

Data categorization

The data in each of these CSV files is categorized into either strings (primarily for names and Set IDs), enum-likes, numbers, or booleans. One of these that caught my attention was the categorization of rel_types in part_relationships.csv, where P stands for Print, R for Pair, B for Subpart, M for Mold, T for Pattern, and A for alternate. This acronymization is not present in parts.csv, where materials are listed in full, despite there being more materials than possible part relationships.

Miscellaneous Observations

The dataset consists of 12 .CSV files, with the longest one being inventory_parts.csv with 1.2 million lines, and the shortest one being colors.csv with 268 lines.
The files are dated for May 3rd, 2024, at 8:24 AM as their last modification date.
There are ~33,500 part inventories catalogued in inventory_parts.csv, ~8,700 in inventory_minifigs.csv, and ~1,200 in inventory_sets.csv.
There are seven types of materials listed in parts.csv: Plastic, Rubber, Cardboard/Paper, Cloth, Flexible Plastic, Metal, and Foam.

Lastly, Questions

I have no hypothesis as to why elements.csv is missing that many design_ids or why themes.csv is missing that many parent_ids. If I could find out, I would be able to either work around this in the future or interpolate the data so that the gaps do not matter.
Are these CSVs saved using an encode other than ASCII or UTF-8? I had to switch to cp850 while writing my analyzer in order for Python to read the CSVs correctly.

falcontechnologies / wex-2024

Review the Rebrickable data set - Bolu #1