nanounanue commented 4 years ago

Oh boy, this PR tries to solve many issues, and I made several decisions in order to fix some of them.

Closes #791 . there was a bug that was causing evaluations to be calculated several times. This happened also with predictions and protected groups and ranking. Now it just calculated when needed.

Closes #768 . For some reason, even with information about how to break ties, the ranking was calculated after the evaluation and prediction insertion and it was with a postrgesql update. Weirdly , the predictions were queried to postgresql, then, all the ranking calculation were executed in python and then updated. This obviously gets more expensive over time. Now the ranking happens in the calculation of the predictions and it is inserted.

Closes #741. The ranking was using the method lexsort from numpy. There is no option for handling None, by default all the None labels were pushed to the top, independently of the tiebreaker. Now, we use pandas methods (that actually use numpy, so not a bigger computation cost penalty) for doing so. Henceforth, best is we sort the score and then labels in descending order, nulls last. worst we sort score in descending order and then labels in ascendant order, nulls first.

Closes #726 Closes #732 . We were getting weird behaviour if not using random_seed in the experiment configuration file. This behaviour was introduced when random_seed was incorporated. The reason was the random_seed was used in the creation of the experiment_hash. So every time that the experiment was executed without specify a seed, everything was calculated (except matrices). This was nonsensical. The experiment is not determined by the seed, but for their description which models? which label? which cohorts? which features?. Given that, you can explore how my results depend on the seed that I choose?. In order to do so, every time an experiment is executed the random seed used is store in the experiment_run. Every model now has a link to the run that created them (therefore you can link which random seed influenced the training and evaluation of the model) If the random seed is specified in the configuration file, and if you run several times the experiment everything is executed just once. If you remove the random seed form the configuration file, you will get a experiment, and each run will get you different models, predictions and evaluations. Note that you will end with potentially many models for the same as_of_date in a specific model group. This is intended, if you get a lot of variance in your metric or your list, your model is fragile, and maybe you don't want that. Changes to Audition needs to be done, though (see issue #800).

codecov-commenter commented 4 years ago

Codecov Report

Merging #799 into master will decrease coverage by 0.61%. The diff coverage is 66.15%.

@@            Coverage Diff             @@
##           master     #799      +/-   ##
==========================================
- Coverage   82.47%   81.86%   -0.62%     
==========================================
  Files          98      102       +4     
  Lines        6917     6980      +63     
==========================================
+ Hits         5705     5714       +9     
- Misses       1212     1266      +54

Impacted Files	Coverage Δ
...e/component/catwalk/protected_groups_generators.py	`93.65% <0.00%> (-1.52%)`	:arrow_down:
...component/postmodeling/contrast/utils/aux_funcs.py	`0.00% <0.00%> (ø)`
...t/results_schema/alembic/versions/4ae804cc0977_.py	`0.00% <0.00%> (ø)`
...t/results_schema/alembic/versions/9bbfdcf8bab0_.py	`0.00% <0.00%> (ø)`
.../versions/a98acf92fd48_add_nuke_triage_function.py	`0.00% <0.00%> (ø)`
...t/results_schema/alembic/versions/fa1760d35710_.py	`0.00% <0.00%> (ø)`
src/triage/tracking.py	`84.46% <ø> (ø)`
src/triage/cli.py	`80.43% <66.66%> (-1.24%)`	:arrow_down:
src/triage/component/architect/utils.py	`42.50% <66.66%> (+1.47%)`	:arrow_up:
src/triage/component/catwalk/model_trainers.py	`92.05% <66.66%> (ø)`
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 21d99a0...a8f095c. Read the comment docs.

shaycrk commented 4 years ago

Looks pretty good overall, but see my one inline comment in predictors.py (possibly why Travis is sad as well?) and we should do a bit of testing before merging, of course.

nanounanue commented 4 years ago

The errors were related to the sample problem that you mentioned and a new incompatibility between fakeredis and rq (already fixed, all test pass now)

shaycrk commented 4 years ago

@nanounanue -- take another look at the df.sample logic I commented on, but looks good to me otherwise.

dssg / triage

Issue 726 #799

Codecov Report