dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Error on cohort or label duplicates #889

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

Closes #872

Adds a check for duplicates in the labels or cohort table in order to error earlier and with a more helpful message when this occurs (currently, this tends to cause a duplicate primary key error far downstream when saving predictions, which can be difficult to debug).

Note that I actually haven't added a unit test for catching duplicates in the cohort because two routes (via query or labels) for generating these already do a distinct or group by so these shouldn't actually occur. We could consider removing this logic and putting the burden on user (which might enforce better understanding of what triage is doing/expecting), but I'm not sure that's too worthwhile.

@thcrock -- is the fact that there are two (essentially identical) database_reflection.py files (one in src/triage/ and the other in src/triage/component/architect/) an artifact of the merge to a single repo? I'd like to consolidate -- thoughts/preferences on which is a better place for this it to live?

thcrock commented 2 years ago

@shaycrk When it comes to the database reflection util I prefer to get rid of the arachitect version. In the now-outdated cleanup PR https://github.com/dssg/triage/pull/692/files I moved shared things to the top triage level.

shaycrk commented 2 years ago

Thanks @thcrock! I went ahead and consolidated into the top-level file here and didn't appear to break anything in the process 😄 so I'll go ahead and switch this PR out of draft mode since I think it's ready for you to take a look whenever you have a chance.

shaycrk commented 2 years ago

Thanks! I'll go ahead and merge in that case