colindean / homebrew-size-analysis

Analyzing the size of Homebrew formulae bottles
The Unlicense
0 stars 0 forks source link
data-science hacktoberfest homebrew

Homebrew Bottle Size Analysis

Analyzing the size of Homebrew formulae bottles


The Homebrew formula JSON API does not provide package size information for bottles[^def_bottle]. I aim to retrieve package sizes regularly in order to build a database of (package, version, bottle_arch) -> size pairs for future analysis. This analysis could capture:

This is currently mostly an experiment in using simple CLI tools like Make and curl to do some data engineering and science that has the above useful implications.

Current principles


make formula.json  # get the data file
make urls          # split it out
make sizes         # get the sizes


   Formulajson : Homebrew API \n formula.json
   Urls : One URL file per URL
   Database : Database (Unspecified Format)

    [*] --> Formulajson : retrieve latest database
    Formulajson --> Urls : extract triplets, write URL files
    state fork_state <<fork>>

    state join_state <<join>>
    Urls --> fork_state : list URL files
    fork_state --> HTTPRequest1 : retrieve package size
    fork_state --> HTTPRequest2 : through HEAD requests
    fork_state --> HTTPRequestN : to all URLs in files
    HTTPRequest1 --> Sizes1 : write size file
    HTTPRequest2 --> Sizes2 : write size file
    HTTPRequestN --> SizesN : write size file
    Sizes1 --> join_state
    Sizes2 --> join_state
    SizesN --> join_state
    join_state --> Database
    Database --> [*]

    note left of fork_state
      One size file per retrieved URL
    end note

Performance notes

It takes around 80 minutes to run for me two requests at a time in order not to trigger some kind of speed limit at my ISP level [^not_ghcr].

You can check the counts of urls and size files by running something like this:

fd .url data | wc -l
fd .size data | wc -l

If the numbers are the same, you've got the data for the current formula.json.

[^def_bottle]: A bottle is a pre-packaged archive of a formula available in Homebrew. See for more information.

[^not_ghcr]: It's not rate-limiting me. My gateway is working fine but my ISP drops the upstream connection. It's probably some kind of DDOS protection at the DNS level. See notes.txt for ways I might get around this since curl does a DNS lookup every time it launches.