Distributions: unequal bin width

biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

https://orangedatamining.com

Other

4.85k stars 1.01k forks source link

Distributions: unequal bin width #5131

Closed ajdapretnar closed 3 years ago

ajdapretnar commented 3 years ago

[ ] What's wrong?
Distributions in some cases shows unequal bin width, making the histogram confusing. I would expect all bins (bars) to be of equal width.

[ ] How can we reproduce the problem?

File (housing) - Distributions. Select RAD column with bin width at minimum. It seems to happens when there are a lot of integer-like floats and only some decimal data, e.g. [20.0, 20.0, 20.5, 21.0, 21.0, 21.0, 21.6, 24.0, 24.0., 24.0].
[ ] What's your environment?
Operating system: OSX High Sierra
Orange version: 3.28.dev
How you installed Orange: conda/pip

janezd commented 3 years ago

It is not entirely clear what should this visualization look like.

The variable has values 1, 2, 3, 4, 5, 6, 7, 8, 24. For variables with so few distinct values, the widget can also assign have one bin for each value. But what is the bin width in this case? (Note that the x axis is not categorical.)

janezd commented 3 years ago

Or, in general, consider a variable whose distinct values are [1.5, 1.8, 2, 2.34, 10]. With one value per bin, what is the expected bin width?

I would tend to say this works as expected, but with unexpected results. We can add an information icon, explaining that each bin represents one unique value.

ajdapretnar commented 3 years ago

This doesn't happen only for single value per bin. I have a dataset with 56108 instances. The default visualization for a certain variable creates the following bins:

(, 48.67) (5383 instances)
[48.67, 49) (143 instances)
[49, 50) (7426 instances)
[50, 50.33) (246 instances)
[50.33, 50.5) (75 instances)
[50.5, 50.67) (2723 instances)
[50.67, 51) (70 instances)
[51, 52) (18411 instances)
[52,) (3979 instances)

I would expect the following default bins: (, 49), [49, 50), [50, 51), [51, 52), [52,)

ajdapretnar commented 3 years ago

Alternatively, I would expect the bin not to stretch more than the other bins. If a bin represents a single value, then its width should be the same as other bins. For RAD, the first seven bins should be a single large bin, or the final two bins should be two narrow bins with empty space in between. No?

janezd commented 3 years ago

This doesn't happen only for single value per bin.

If bins' boundaries are not round decimal numbers, I guess they must represent single values. Don't you by chance have just 9 distinct values in your data? (We're not talking about single instances but about single values, right?.)

If a bin represents a single value, then its width should be the same as other bins.

If a bin represents a single value, then all bins represent single values and thus have various widths. In this particular case, all widths except the last were 1. But here are 17 instances from heart disease with 9 distinct values. Bar widths are 1, 2, 3 or 4.

All we can do is to let the widths of all bins equal the smallest distance betwen two values (what is currently shown as the narrowest bin).

I've done so in #5139. Please report how this looks on your data.

ajdapretnar commented 3 years ago

We're not talking about single instances but about single values, right?

You're right, I confused the two.

I'll check the PR.