Add distributional information (data, scores, coefs) to DB

dmarx commented 7 years ago

Some reasons this might be valuable:

Gelman's "secret weapon" for tracking trends in coefficients over time: http://andrewgelman.com/2005/03/07/the_secret_weap/
Tracking distribution of features/target can held identify issues in feature engineering. E.g., if proportion of target tends towards zero over time: is this evidence that the model is being utilized to really strong effect? Is there an issue with how we are constructing the target variable?

Should probably rename database. It's called "modeling_results" right now. Maybe just call it "project_db" or something like that.

dmarx commented 7 years ago

New tables:

DATASETS

dataset_id
fpath
name
description

FIELDS

field_id
dataset_id
field_name
field_type

FIELD_STATS

stat_id
field_id
stat_name
stat_value

Stats we want for features/targets:

count
uniques
nulls
mean
median
mode
sd
q25
q50
q75

Stats we want for modeling coefficients:

exp_id
var_name
coef
coef_err_low
coef_err_hi
coef_err_type

The framework above can probably be generalized sufficiently to supplant the RESULTS framework I've already got in the schema.

Maybe I don't need to be so insanely generalized. We can have a couple of results tables, and we can also have separate tables for tracking stats on the data and stats on coefs. I feel like model coefs lends itself well to combining with the data stats. Modeling results should be fine, but the text thing... I dunno. Also I need to figure out how best to add tasks to this schema, since we have tasks for base tables, models, scores, and evals, but not upstream tables.

dmarx commented 7 years ago

Let's break this up into a few separate tasks.

Log distributional information for raw data
log distributional information for features/targets
log distributional information for ABTs
log distributional information for model coefficients

ABTs are fairly well formed already, so I think that's a good place to start. Logging model coefficients is probably non-trivial and might not be something we want to automate (e.g. we wouldn't want to log coefficients in a deep NN).

Yeah, logging data on ABTs I think makes sense. More generally, automating data profiling on ABTs probably isn't a terrible idea either.

dmarx / make_for_datascience

Add distributional information (data, scores, coefs) to DB #23