Closed psftc closed 2 weeks ago
As is indicated on the downloads page of Rebrickable, the data in these files is laid out in a node graph in terms of increasing data specificity.
For example, if you wanted to look up the LEGO set that inventory 199628
belongs to, you would consult column id
of inventory.csv
, and find that 199628's corresponding set_num
is 42605-1
.
Supposing you wanted to find out more about this set, you'd head down the graph to sets.csv
. Consulting the set_num
column for 42605-1
yields that its name is "Mars Space Base and Rocket", it has 981 parts, and also a link to an image of the built set on Rebrickable's CDN.
Going deeper, we can also see the individual parts in this set by going deeper into the graph, into inventory_parts.csv
(If this were any other set, you could also go back up to inventory.csv
and search for our set number in the set_num
column to see if there is more than 1 inventory that belongs to it, but 42605-1
only has one) and searching for inventory 199628 in the inventory_id
column. There are a total of 382 parts that belong to this inventory. Let's zero in on part_num
10106055.
Traversing the graph further, we can drop into parts.csv
, where we can search for part_num
10106055, finding out its name, "Sticker Sheet for Set 42605-1". Lastly, we can drop into our last node, part_categories.csv
and use 10106055's part_cat_id
in the node's id
column to find the part category, "Stickers".
There are several instances of empty columns within the dataset. I wrote a quick analyzer in Python to index these empty columns.
inventory_parts.csv
, ~7,800 of them are missing img_url
entries, likely due to gaps in the sourcing for images of every catalogued part.elements.csv
's ~89,000 rows, it is missing ~18,600 design_id
entries.themes.csv
is missing 137 out of 460 parent_id
entries.The data in each of these CSV files is categorized into either strings (primarily for names and Set IDs), enum
-likes, numbers, or booleans. One of these that caught my attention was the categorization of rel_types
in part_relationships.csv
, where P stands for Print, R for Pair, B for Subpart, M for Mold, T for Pattern, and A for alternate. This acronymization is not present in parts.csv
, where materials are listed in full, despite there being more materials than possible part relationships.
inventory_parts.csv
with 1.2 million lines, and the shortest one being colors.csv
with 268 lines. inventory_parts.csv
, ~8,700 in inventory_minifigs.csv
, and ~1,200 in inventory_sets.csv
. parts.csv
: Plastic, Rubber, Cardboard/Paper, Cloth, Flexible Plastic, Metal, and Foam.elements.csv
is missing that many design_id
s or why themes.csv
is missing that many parent_id
s. If I could find out, I would be able to either work around this in the future or interpolate the data so that the gaps do not matter.cp850
while writing my analyzer in order for Python to read the CSVs correctly.
The data set used is from the Rebrickable website here Rebrickable Downloads. This issue has several steps to complete