fix: Resolve #3. Reduce STLoR codebase to the minimum dependencies and code needed to reproduce the public dataset.

Buckle in, this is a chunky one. On the plus side, this reduces the amount of code we need to maintain from 9500 LOC in the land_grab_2 directory to 2142 LOC in the stlor directory—a whopping 77.5% decrease! And given the addition of formatting and doc strings that tend to add lines, this is likely an even larger reduction.

Strategy

My core strategy with this PR was to strike a balance between more straightforward changes (dead code elimination, paring down dependencies) and more aggressive, semantics-preserving refactoring. There's undoubtedly more we could do here, but I think this gets us to a substantially more maintainable place.

Checking for Correctness

How do we know these changes preserve the core of the factchecked dataset? For this, I relied on our new CSV semantic compare functionality, introduced in stlor/compare.py. I generated the output datasets using code from 3ebf200dcf8db2fcf789ed234658feee8da1d503, with 02_SendToActivityMatch.geojson as input. I then compared the output generated by that commit and the output generated by these changes for semantic equivalence by running:

python stlor/compare.py "public_data/04_All States/3ebf200d.csv" "public_data/04_All States/03_ActivityMatch-Original.csv"

Note that I brought in 3ebf200d.csv locally to run the comparison. Additionally, I patched 3ebf200d locally with a bug fix that will be headed in upstream to land-grab-2 shortly.

How to Read this PR

I think it's worthwhile to focus in on the code changes in the stlor directory, but not much else. A huge bulk of the changes here are related to data directory restructuring.

Key Changes

Core Code

All core code lives in the stlor directory, which has the following structure:

__init__.py → Necessary to direct Python to treat the stlor directory as a package.
- Note: I'm not convinced we actually want a package here, but this keeps with the convention of land-grab-2.
activity.py → Contains all functions related to capturing activity and activity_info information.
compare.py → Our script for semantically comparing CSVs. This accounts for inconsistent row and column ordering between CSVs, in addition to variable ordering of substrings in the activity and activity_info columns. This will be what we run in CI to alert us when a code change results in an update to the dataset.
config.py → All configuration for the activity matching process. This is primarily composed of dictionaries mapping state codes to StateActivityDataSources, state codes to activity codes, state codes to activity rewrite rules, and activity strings to rights types.
constants.py → What it says on the tin!
entities.py → Our small handful of classes, including StateActivityDataSource, StateForActivity, and the RightsType enum.
main.py → The primary functions for running our activity matching process.
overlap.py → All functions for our R-tree spatial index construction and probing, along with spatial predicates for determining relationships between STL parcels and state activity layers.
utils.py → The dreaded (but necessary) utils file, primarily housing functions for parallelism with dask.

Function Collapsing and Simplification

land-grab-2 had a very large API surface, with a lot of utility functions overloaded with kwargs to support different use cases. Investigating the call sites of functions that are used in the activity matching process, a tremendous amount of this diverse functionality was never used.

Additionally, the "cut points" for functions didn't always make sense; functions would often either do way too much or too little. I strove for fewer functions with tidier encapsulation of behavior. This is largely a matter of style and opinion, and at certain points I decided to just let the existing function boundaries stay as they were. I know, real principled.

Linting and Formatting

This PR integrates Ruff for linting and formatting, which is the state of the art in the ecosystem. You can lint and format the codebase like so:

ruff check

Alternatively, install the Ruff VSCode extension to enable linting and formatting on save.

Type Hints

This PR also adds Python type hints! This felt like a nice middle ground between full dynamic tying and integrating a more aggressive static type checker like mypy. Particularly in the deep reeds of refactoring, having some sense of the types being assed around by our functions was extremely useful. We can have a larger discussion about integrating a static type checker at another time, but I think this is a good enough middle ground for now.

Doc Strings

This PR also adds doc strings to our functions, documenting arguments, return values, and side effects.

Standardizing on `pip`, `venv`, and `pyproject.toml`

Previously, this project had configuration set up for Poetry, but we weren't actually using it. This PR scraps poetry for a lower friction standard setup with pip, venv, and pyproject.toml. Users can get our exact dependency versions by running:

pip install .

Additionally, we had a plethora of unnecessary dependencies for this step. I've pared down our dependencies to just the necessary ones for the activity matching process.

Data Directory Restructuring

A large portion of this diff is connected to repository restructuring. Specifically, files previously located at data/stl_dataset/step_2/input/stl_activity_layers are now located at data/stl_activity_layers. I wanted to remove any reference to our "step" architecture from the land-grab-2 repo.

Future Work

This PR started to get big, and now feels like the right time to cut it. But here are a few ideas for next steps:

Integrate the clipping discussed here
Add CI to build the dataset on every PR and run our semantic comparison script to alert data changes
Update README.md and METHODOLOGY.md once these changes are approved

Additionally, I have a few concerns about the datasets in the public_data folder.

03_ActivityMatch.csv and 03_ActivityMatch.geojson currently in source do not have the same column names, which suggests that 03_ActivityMatch.csv was manually cleaned up in an undocumented step. It appears to just be:
- Values from the activity_info and activity_info_2 columns were concatenated into a single activity_info column.
- The set of columns was trimmed down.
We can certainly replicate this in code, but I don't have true proof that those were the only steps taken to produce that CSV.
04_Clipped.geojson and 05_AcreageGreaterThan10.geojson have the same set of columns as 03_ActivityMatch.geojson (good!) but different columns than 06_All-STLs-on-Reservations-Final.geojson (bad!) and 02_All-STLs-on-Reservations.geojson (bad!). Again, it looks like the same set of changes as described above, but I don't have computational proof yet.

Given the above, and the activity translation bug I found in this refactor, I'm pretty concerned about the integrity of all datasets after the activity matching process. If we wanted to have everything derived from code as the source of truth, I think we'd need to undertake migrating the entirety of the Methodology post-03-ActivityMatch into code.

Grist-Data-Desk / STLoR