dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Make using groups other than `entity_id` in collate less error-prone #874

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

Mostly for discussion, but it seems like trying to use anything aside from entity_id when specifying groups in a feature aggregation (e.g., zipcode, etc) is currently pretty error-prone. For instance, if there isn't a 1-to-1 relationship between the entity_id and these other columns, you'll end up with multiple records in the matrix with the same (entity_id, as_of_date) key, which causes many problems downstream.

Thoughts on how to improve the functionality here? Some options:

shaycrk commented 2 years ago

Note that we decided to remove support for collate groups other than entity_id for the time being via #887, especially pending further discussion on what direction we want to go with feature engineering generally in the future. Will go ahead and close this issue for now, though it's possible someone might want to revisit this question in the future.