capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.42k stars 158 forks source link

Make _assimilate_histogram() not use self #1071

Closed junholee6a closed 8 months ago

junholee6a commented 9 months ago

NOTE: This is an alternative of PR https://github.com/capitalone/DataProfiler/pull/1073. If this is merged, then close PR https://github.com/capitalone/DataProfiler/pull/1073

Issue: https://github.com/capitalone/DataProfiler/issues/820

This is a necessary step to resolving issue https://github.com/capitalone/DataProfiler/issues/820. Previously, _assimilate_histogram() called self to decide whether the given histogram contained integers or floats, and rounded the bins for histograms that only contained integers.

_assimilate_histogram() is called in exactly two places: _regenerate_histogram() and _add_helper_merge_profile_histograms(). To remove the dependency on self, we can move the code for rounding bin edges to _regenerate_histograms() and add an argument indicating whether the given histogram contains integers or floats.

While _regenerate_histogram() should behave exactly the same as before, the histogram loss calculation in _add_helper_merge_profile_histograms() may change slightly since we no longer round its histogram bins.

taylorfturner commented 8 months ago

Taking a gander at these today -- thanks for the patience, @junholee6a!

taylorfturner commented 8 months ago

@junholee6a just one comment on this from @ksneab7. We opted for this PR over the alternative (closed now). Let us know if you have any questions.

We are trying to release 0.10.8 soon and I'd like this to be part of it -- so the sooner the better on this PR. Thanks!

taylorfturner commented 8 months ago

Thanks @junholee6a -- merged