dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
185 stars 60 forks source link

Utility for writing predictions of selected model groups #836

Closed kasunamare closed 3 years ago

kasunamare commented 3 years ago

This PR contains a script for saving the predictions of a set of selected model groups. This is meant to be used for the following use-case:

The config file for this util takes the experiment hash, a list of model groups, the project path, and optionally a range of train_end_times.

shaycrk commented 3 years ago

Looks good overall, but put a couple of quick notes inline. Would probably also be good to include a brief README.md here with usage instructions.

kasunamare commented 3 years ago

Thanks for taking a look. Made the changes and added a README

thcrock commented 3 years ago

Should this be in the triage CLI (at src/triage/cli.py) ? It should make for easier discovery. Audition and other postmodeling work already have interfaces there.

kasunamare commented 3 years ago

@thcrock that makes sense to me. @shaycrk what do you think?

Making it more visible would help a user to plan the experiment better, I guess. As it would enable them to run a larger model grid with save_predictions=False without worrying too much about DB space and speed issues brought about by bloating prediction tables.

shaycrk commented 3 years ago

@thcrock -- this actually touches on a conversation @KasunAmare and I have been having separately about how some of these pieces fit together. It feels like adding predictions to existing models, adding individual explanations, and predicting forward are all cases where we might want to take one (or a handful) of existing model_group_ids and generate new outputs, and might benefit from re-using some of the experiment machinery and tracking to do so. I feel like this PR gives us a good starting point, but might change a fair bit if we couple the code more closely with those other use cases as well as existing experiments. If that direction makes sense to you, do you think it makes sense to add a CLI interface now or wait until that's a bit more built out?

thcrock commented 3 years ago

This code is adding a command line interface already. My suggestion is just that if we are adding a CLI for a new piece of code, to make it accessible from the triage command instead of fragmenting the CLIs.

I have no problem with adding new code sans CLIs if the goal is to iterate on it first.

kasunamare commented 3 years ago

@thcrock @shaycrk -- What do you think the next step should be? remove the CLI and keep the code in postmodeling? or keep the CLI and move it to the triage CLI? While this could be a part of something larger and will probably change in the future like kit mentioned, I feel like it is helpful in the current form too, at least to the two active projects. In that sense, I feel like keeping the CLI would make it easier to use. So, I'm slightly partial to the latter What do you guys think?

thcrock commented 3 years ago

I'm partial to the latter. The code in __main__ can very easily be ported to the Triage CLI, since it's already using argparse. I think the main reason to not add a CLI is if you're specifically trying to hide it for now.

thcrock commented 3 years ago

And actually, I think you can cut down on a lot of the code in run by making use of what's inherited from the root CLI command. You can just pass in the 'dburl' from the root CLI command to run and remove all of the dbfile parsing code (you could also pass in a db_engine much like the Crosstabs command does). This has the other advantage of not forcing the user to have a db credentials file; the CLI allows them to use environment variables which are safer. This type of standardization is one of the reasons for the Triage CLI vs fragmented smaller ones.

shaycrk commented 3 years ago

Yeah, I think @thcrock's points seem reasonable to me, and agree it's useful in the current form as a starting point that can hopefully merge more into the existing experiment (as well as new predict forward) infrastructure.

I think in using it we've also run into a few other small things that might be worth adjusting before merging:

kasunamare commented 3 years ago

@shaycrk @thcrock -- Made the following changes to the code.

Sorry, this took so long. If you guys can take a look that'll be great.

Question -- I am currently testing this with the ACLU database and its experiments. I wasn't sure how to add a test for this in test_cli.py with an example config in example/config. I'm guessing putting ACLU experiment hashes, model group ids as an example wouldn't be ideal? What would be the best way to add a test?

shaycrk commented 3 years ago

Thanks @KasunAmare -- I took a pass through and added a few notes inline!

On the question about tests, I think most of the tests in test_cli.py simply check that the right functionality is invoked without running it, so you can probably follow the patterns there for the same thing.

To test the generate_predictions function itself here, you'll probably want to create a new file in postmodeling_tests and then use a test experiment/DB to make sure predictions get added. See conftest.py for an example of generating such an experiment and then postmodeling_tests/test_without_predictions.py for how this can be used.

kasunamare commented 3 years ago

Thanks, @shaycrk. I'm happy to wait till we tag the new release.