UCBoulder / oit-ds-tools-prefect

Common tasks and tools for use in Prefect Flows
0 stars 0 forks source link

Add comparer and associated unit tests #18

Closed jashbycu closed 1 year ago

jashbycu commented 1 year ago

Unit tests focus on low-level functions, since high-level functions are pretty basic wrappers of these. I did test the high-level functions as well.

The intent here is that by returning information in dicts, it can be easily adapted to other formats (like thrown into a DataFrame or pretty-printed or whatever). So while you can use this module in your own manual testing and validation, we could also easily integrate it with our next-gen cacheing system that we have planned: https://cu-oda.atlassian.net/browse/UCBDE-63?search_id=23f1d8af-8177-4ccd-89a8-6433d7a915ce

This is why I added this to the ucb_prefect_tools library, even though it doesn't directly relate to it right now. This paves the way for us to automatically generate diff reports whenever we cache a dataframe.

My approach to comparing dataframes was grounded in the scenarios I believe we face most often in higher ed: where we have fairly similar before/after datasets, but we don't always know what the primary keys should be, and we don't really care about a lot of floating point precision. So I wrote my code to highlight impactful differences, while ignoring what is likely to be just noise (like differences in data types, small numerical differences, etc.).

Let me know what you think!