Grist-Data-Desk / STLoR

Code and methodology to produce the dataset in Grist and High Country News' investigation into state trust lands on reservations
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

fix: Remove GeoJSON files and zip archives from Git LFS. #14

Closed parkerziegler closed 1 month ago

parkerziegler commented 1 month ago

This PR takes the first step in our quest to requisition state from Git LFS. All .geojson files and .zip archives in the codebase are under GitHub's 100MB individual file size limit and thus do not need to be stored using Git LFS. This requisitioning wins us back 778 MB (GeoJSON) + 43 MB (zip) = 821 MB that is no longer stored nor counts against bandwidth limits.

Unfortunately, we still have a battle to wage to requisition space for .shp (1.5 GB) and .dbf (4.1 GB) files. There are a handful of files of these extension types that we have to store via Git LFS due to their size (>100 MB), and the Git LFS endruns that we'll need to do for these files is a bit wonky. But, I do believe there is still an opportunity to reduce our total storage down to ~1 GB as opposed to the ~5.4 GB before this PR.

As a reference, here are the steps you can take to remove files previously tracked by Git LFS and replace their pointers with the true file contents.

git lfs untrack "*.<ext>"
git rm --cached "*<.ext>"
git add .
parkerziegler commented 1 month ago

FYI @clayton-aldern I'm going to merge this to avoid blocking current work!

clayton-aldern commented 1 month ago

Thanks much!

clayton-aldern commented 1 month ago

oh quick question @parkerziegler—when untracking LFS files by extension, should we hold off on untracking .shp / .dbf files right now, given the discussion above, and just untrack .geojson and .zips?

parkerziegler commented 4 weeks ago

@clayton-aldern Yep! The issue is we have a handful of files of both extension types (.shp, .dbf) that exceed the 100MB limit. We'll need to untrack the subset of files with those extension types that are <100MB in size, then ensure that the remaining files stay tracked in Git LFS. Unfortunately, the workflow above doesn't quite work for this edge case, but I wanted to shift to higher priority things before sinking more time into it.