capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.42k stars 158 forks source link

Make _assimilate_histogram() not use self (alternative) #1073

Closed junholee6a closed 8 months ago

junholee6a commented 9 months ago

NOTE: This is an alternative for PR https://github.com/capitalone/DataProfiler/pull/1071. If this is merged, then close PR https://github.com/capitalone/DataProfiler/pull/1071

Issue: https://github.com/capitalone/DataProfiler/issues/820

This is a necessary step to resolving issue https://github.com/capitalone/DataProfiler/issues/820. Previously, _assimilate_histogram() called self to decide whether the given histogram contained integers or floats, and rounded the bins for histograms that only contained integers.

However, that rounding seems unnecessary. Here, we remove that rounding code entirely and modify the one test that fails, TestTextColumnProfiler.test_profile(). To make sure the test is still valid, here are its values:

The data in the profile of that test is:

data

The old expected histogram is:

old_histogram

And the new expected histogram is:

new_histogram
taylorfturner commented 8 months ago

Taking a gander at these today -- thanks for the patience, @junholee6a!

taylorfturner commented 8 months ago

Sticking with 1071 for this changed. Closing 1073