dmarx / make_for_datascience

Demonstration of how to use Gnu Make effectively in a robust data analytics pipeline
MIT License
3 stars 0 forks source link

Add distributional information (data, scores, coefs) to DB #23

Open dmarx opened 7 years ago

dmarx commented 7 years ago

Some reasons this might be valuable:

Should probably rename database. It's called "modeling_results" right now. Maybe just call it "project_db" or something like that.

dmarx commented 7 years ago

New tables:

DATASETS

FIELDS

FIELD_STATS

Stats we want for features/targets:

Stats we want for modeling coefficients:

The framework above can probably be generalized sufficiently to supplant the RESULTS framework I've already got in the schema.

Maybe I don't need to be so insanely generalized. We can have a couple of results tables, and we can also have separate tables for tracking stats on the data and stats on coefs. I feel like model coefs lends itself well to combining with the data stats. Modeling results should be fine, but the text thing... I dunno. Also I need to figure out how best to add tasks to this schema, since we have tasks for base tables, models, scores, and evals, but not upstream tables.

dmarx commented 7 years ago

Let's break this up into a few separate tasks.

ABTs are fairly well formed already, so I think that's a good place to start. Logging model coefficients is probably non-trivial and might not be something we want to automate (e.g. we wouldn't want to log coefficients in a deep NN).

Yeah, logging data on ABTs I think makes sense. More generally, automating data profiling on ABTs probably isn't a terrible idea either.