Analyzing the size of Homebrew formulae bottles
The Homebrew formula JSON API does not provide package size information for bottles[^def_bottle].
I aim to retrieve package sizes regularly in order to build a database of (package, version, bottle_arch) -> size
pairs for future analysis.
This analysis could capture:
This is currently mostly an experiment in using simple CLI tools like Make and curl to do some data engineering and science that has the above useful implications.
-j
, xargs, parallel, fd, ripgrep, etc.make formula.json # get the data file
make urls # split it out
make sizes # get the sizes
stateDiagram-v2
Formulajson : Homebrew API \n formula.json
Urls : One URL file per URL
Database : Database (Unspecified Format)
[*] --> Formulajson : retrieve latest database
Formulajson --> Urls : extract triplets, write URL files
state fork_state <<fork>>
state join_state <<join>>
Urls --> fork_state : list URL files
fork_state --> HTTPRequest1 : retrieve package size
fork_state --> HTTPRequest2 : through HEAD requests
fork_state --> HTTPRequestN : to all URLs in files
HTTPRequest1 --> Sizes1 : write size file
HTTPRequest2 --> Sizes2 : write size file
HTTPRequestN --> SizesN : write size file
Sizes1 --> join_state
Sizes2 --> join_state
SizesN --> join_state
join_state --> Database
Database --> [*]
note left of fork_state
One size file per retrieved URL
end note
It takes around 80 minutes to run for me two requests at a time in order not to trigger some kind of speed limit at my ISP level [^not_ghcr].
You can check the counts of urls and size files by running something like this:
fd .url data | wc -l
fd .size data | wc -l
If the numbers are the same, you've got the data for the current formula.json
.
[^def_bottle]: A bottle is a pre-packaged archive of a formula available in Homebrew. See https://docs.brew.sh/Bottles for more information.
[^not_ghcr]: It's not ghcr.io rate-limiting me. My gateway is working fine but my ISP drops the upstream connection. It's probably some kind of DDOS protection at the DNS level. See notes.txt for ways I might get around this since curl does a DNS lookup every time it launches.