dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Colab Triage Tutorial #878

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

Created a quick, light-weight walkthrough of triage using Google Colab as an easier way to introduce people to triage without any setup before they dive deeper via the dirty duck tutorial. If you want to run through the notebook, you can do so here (note you'll need to log in with a google account):

https://colab.research.google.com/github/dssg/triage/blob/kit_colab_triage/example/colab/colab_triage.ipynb

Let me know what you think here, particularly around the level of depth. I wanted to cover all the major aspects without getting bogged down in too many details (and providing links out to other resources with more information).

ecsalomon commented 2 years ago

I went through the first 1/3 to 1/2 of this tutorial so far. I like that it starts with a well known and familiar data set and proceeds from just pip installing triage. I wonder if we want to get started even faster, tho. One possibility would be to start with something like:

This notebook provides a quick, interactive tutorial for triage, a python machine learning pipeline for social good problems using a sample of the data provided by DonorsChoose to the 2014 KDD Cup. DonorsChoose allows teachers to crowdsource funding for projects, <example project; also, why do we live in a world where teachers are begging for money for their classrooms on the internet? 😭 >. Projects on DonorsChoose expire after 4 months, and if the target funding level isn't reached, the project receives no funding.

DonorsChoose has hired a digital content expert who will review projects and help teachers improve their postings and increase their chances of reaching full funding. The digital content expert has time to review X <I think abs threshold makes sense for this rather than %, but 🤷🏻 > proposals every day.

You are a data scientist working with DonorsChoose, and your task is, every morning, to identify the projects posted in the last day that are least likely to be fully funded in the next four months and pass them off to the digital content expert for review.

Then, I would probably walk into the, this is what a label will look like, this is where the features will come from (temporally) and then the daily cohort is the projects posted in the last day. Your daily result is a list of top X projects, and your success criterion is the precision on that list.

Then get into, now that we know the shape of the problem, let's set up our machine and figure out how to represent it in data and config and add in the additional complexity of including multiple days in the train set. Right now, especially in the visual, I think it's hard to understand how that works.

nanounanue commented 2 years ago

I totally agree with the @ecsalomon 's comment!

Actually I am using that problem description in today's class :)

I hope to have time this week to finish the triage's collab, but so far seems very good :+1:

shaycrk commented 2 years ago

Thanks @ecsalomon and @nanounanue -- did you happen to have a chance to finish running through it?

Yeah, I definitely want to keep it on the shorter side as much as possible, but figured most people wouldn't be familiar with the donors choose data, so thought it would be important to give a little overview of the data (also how to interact with the database in colab). But it sounds like you both think it would be better to skip straight to the triage/modeling setup?

ecsalomon commented 2 years ago

I think for me it's more: dive fully into the problem framing and what a label and feature would look like conceptually before doing that data exploration and then use that framing to drive the data exploration in a goal-oriented way (" I know I need to calculate x as a label; what would I need to know to calculate it; where is that in the data?")

nanounanue commented 2 years ago

Hi @shaycrk I think that there is a problem with the TCV image. The train_label_timespan and test_label_timespan are swaped.

nanounanue commented 2 years ago

And I totally agree with @ecsalomon , dive a little more in how the problem is model. For example, predicting only in the projects of that day seems weird, because we are ignoring all the potential dynamics of the donations... Or am I missing something?

nanounanue commented 2 years ago

Final comment (I never used collab before) Is not possible to prepare the image (I am assumming a docker image, vm image or something similar), so you don't start with the installation process?

shaycrk commented 2 years ago

Ok -- updated the introduction to give better context and orientation to the modeling problem along the lines of what @ecsalomon and @nanounanue suggested. Let me know what you think!

@nanounanue -- to your two other questions:

Anyway, take another quick look at the changes here when you get a chance and see what you think. I'm hoping it's in ok shape to merge (at least as a first pass, even if we want to continue improving it in the future). Thanks!

shaycrk commented 2 years ago

Merged master and updated the docs to reflect the new tutorial (unfortunately, the colab links won't work until this gets merged to master as they reference the repo...).

A couple questions (I think for @nanounanue):

shaycrk commented 2 years ago

@ecsalomon and @nanounanue -- Just checking back if you have time to take another quick look here. Thanks!

shaycrk commented 2 years ago

Going ahead and merging since I think it's in good shape and has been lingering for a while. We can certainly continue to update and refine the colab tutorial further in the future if anyone has more feedback on it.