dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Allow users to specify a single cohort + label query in the config #868

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

The separate cohort and label queries in the triage configuration file often lead to some confusion and users duplicating logic between the two to ensure labels are calculated for the right set of entities at a given as_of_date. Although there are good use cases for separate queries and we should maintain the option, in general, it would be nice to be able to specify a single query that provides both the set of entity_ids and outcome values at a given point in time.

A quick and dirty way to do this might be the same way we handle specifying a single label_timespans parameter, by duplicating the value and associating it with both train_label_timespans and test_label_timespans in the config that actually gets used by triage downstream. One thing to explore with this approach is whether the cohort query has access to the label timespan parameter when it gets run? If not, it should be easy enough to grab the first value in this list to use (making the assumption that the cohort shouldn't depend on the length of the label window, as we already assume).

thcrock commented 2 years ago

@shaycrk I just sat down to start on this, and the first step was going to be the interface. I was looking at the example config file and trying to think of what a clear way to message this to the user would be. Looking at it, I had an idea. What do you think of simply making the cohort_config optional, instead of adding a new config key? If the cohort_config is missing, treat the label config as the source of the cohort.

thcrock commented 2 years ago

@shaycrk pinging you again on the above question

shaycrk commented 2 years ago

@thcrock -- sorry, I missed this question before! I think that generally makes sense from a parsimony standpoint, though from a user standpoint, I wonder if it's more intuitive for the label config to be optional (in which case the cohort config would also need to specify the label)?

That naming might make it more explicit that this query will be used to identify the cohort as well as the label that you're specifying as a column (whereas naming it label_config seems to make the fact that it's also specifying the cohort a little hidden behind the scenes). But that also seems like it might be a little more messy in terms of the code, and I might be overthinking it, so I don't feel too strongly here.

shaycrk commented 2 years ago

closed via #877