dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Remove support for non-entity_id grouping for features #887

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

Specifying groups other than entity_id has been error-prone because it requires (but doesn't enforce/check for) a one-to-one relationship between the entities and other grouping levels (discussed in #874). As such, this PR removes support for specifying the groups key in the feature configuration (as well as removing instances where this is done/discussed in the tests and docs), bumping the config version to v8 as result.

Notably, this doesn't change the underlying collate code itself -- there doesn't seem to be anything special there about the entity-level grouping as groups is simply passed as a list which is iterated over in a few places. Restricting the list to just ["entity_id"] means a few loops are just traversed once but doesn't result in a bunch of cruft/untouched code, and it hardly seems worth simply removing the for loops.

A couple of incidental/unrelated changes as well, I updated documentation to reflect the python 3.7+ requirement as well as the python version used by ./develop to 3.9.10.

Also, @ecsalomon -- once we're ready to merge both this and #883, we should be sure to update the upgrade-to-v8.md documentation to reflect both this change and specifying SQL files for the labels/cohort configs.