Grist-Data-Desk / STLoR

Code and methodology to produce the dataset in Grist and High Country News' investigation into state trust lands on reservations
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

fix: Resolve #3. Reduce STLoR codebase to the minimum dependencies and code needed to reproduce the public dataset. #8

Closed parkerziegler closed 1 month ago

parkerziegler commented 1 month ago

Buckle in, this is a chunky one. On the plus side, this reduces the amount of code we need to maintain from 9500 LOC in the land_grab_2 directory to 2142 LOC in the stlor directory—a whopping 77.5% decrease! And given the addition of formatting and doc strings that tend to add lines, this is likely an even larger reduction.

Strategy

My core strategy with this PR was to strike a balance between more straightforward changes (dead code elimination, paring down dependencies) and more aggressive, semantics-preserving refactoring. There's undoubtedly more we could do here, but I think this gets us to a substantially more maintainable place.

Checking for Correctness

How do we know these changes preserve the core of the factchecked dataset? For this, I relied on our new CSV semantic compare functionality, introduced in stlor/compare.py. I generated the output datasets using code from 3ebf200dcf8db2fcf789ed234658feee8da1d503, with 02_SendToActivityMatch.geojson as input. I then compared the output generated by that commit and the output generated by these changes for semantic equivalence by running:

python stlor/compare.py "public_data/04_All States/3ebf200d.csv" "public_data/04_All States/03_ActivityMatch-Original.csv"

Note that I brought in 3ebf200d.csv locally to run the comparison. Additionally, I patched 3ebf200d locally with a bug fix that will be headed in upstream to land-grab-2 shortly.

How to Read this PR

I think it's worthwhile to focus in on the code changes in the stlor directory, but not much else. A huge bulk of the changes here are related to data directory restructuring.

Key Changes

Core Code

All core code lives in the stlor directory, which has the following structure:

Function Collapsing and Simplification

land-grab-2 had a very large API surface, with a lot of utility functions overloaded with kwargs to support different use cases. Investigating the call sites of functions that are used in the activity matching process, a tremendous amount of this diverse functionality was never used.

Additionally, the "cut points" for functions didn't always make sense; functions would often either do way too much or too little. I strove for fewer functions with tidier encapsulation of behavior. This is largely a matter of style and opinion, and at certain points I decided to just let the existing function boundaries stay as they were. I know, real principled.

Linting and Formatting

This PR integrates Ruff for linting and formatting, which is the state of the art in the ecosystem. You can lint and format the codebase like so:

ruff check

Alternatively, install the Ruff VSCode extension to enable linting and formatting on save.

Type Hints

This PR also adds Python type hints! This felt like a nice middle ground between full dynamic tying and integrating a more aggressive static type checker like mypy. Particularly in the deep reeds of refactoring, having some sense of the types being assed around by our functions was extremely useful. We can have a larger discussion about integrating a static type checker at another time, but I think this is a good enough middle ground for now.

Doc Strings

This PR also adds doc strings to our functions, documenting arguments, return values, and side effects.

Standardizing on pip, venv, and pyproject.toml

Previously, this project had configuration set up for Poetry, but we weren't actually using it. This PR scraps poetry for a lower friction standard setup with pip, venv, and pyproject.toml. Users can get our exact dependency versions by running:

pip install .

Additionally, we had a plethora of unnecessary dependencies for this step. I've pared down our dependencies to just the necessary ones for the activity matching process.

Data Directory Restructuring

A large portion of this diff is connected to repository restructuring. Specifically, files previously located at data/stl_dataset/step_2/input/stl_activity_layers are now located at data/stl_activity_layers. I wanted to remove any reference to our "step" architecture from the land-grab-2 repo.

Future Work

This PR started to get big, and now feels like the right time to cut it. But here are a few ideas for next steps:

  1. Integrate the clipping discussed here
  2. Add CI to build the dataset on every PR and run our semantic comparison script to alert data changes
  3. Update README.md and METHODOLOGY.md once these changes are approved

Additionally, I have a few concerns about the datasets in the public_data folder.

  1. 03_ActivityMatch.csv and 03_ActivityMatch.geojson currently in source do not have the same column names, which suggests that 03_ActivityMatch.csv was manually cleaned up in an undocumented step. It appears to just be:

    • Values from the activity_info and activity_info_2 columns were concatenated into a single activity_info column.
    • The set of columns was trimmed down.

    We can certainly replicate this in code, but I don't have true proof that those were the only steps taken to produce that CSV.

  2. 04_Clipped.geojson and 05_AcreageGreaterThan10.geojson have the same set of columns as 03_ActivityMatch.geojson (good!) but different columns than 06_All-STLs-on-Reservations-Final.geojson (bad!) and 02_All-STLs-on-Reservations.geojson (bad!). Again, it looks like the same set of changes as described above, but I don't have computational proof yet.

Given the above, and the activity translation bug I found in this refactor, I'm pretty concerned about the integrity of all datasets after the activity matching process. If we wanted to have everything derived from code as the source of truth, I think we'd need to undertake migrating the entirety of the Methodology post-03-ActivityMatch into code.

clayton-aldern commented 1 month ago

Yayyyy amazing. Looking forward to reviewing. 2d10f3a is looking hefty already. :)