biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Distributions: unequal bin width #5131

Closed ajdapretnar closed 3 years ago

ajdapretnar commented 3 years ago
Screen Shot 2020-12-16 at 10 02 49
janezd commented 3 years ago

It is not entirely clear what should this visualization look like.

The variable has values 1, 2, 3, 4, 5, 6, 7, 8, 24. For variables with so few distinct values, the widget can also assign have one bin for each value. But what is the bin width in this case? (Note that the x axis is not categorical.)

janezd commented 3 years ago

Or, in general, consider a variable whose distinct values are [1.5, 1.8, 2, 2.34, 10]. With one value per bin, what is the expected bin width?

I would tend to say this works as expected, but with unexpected results. We can add an information icon, explaining that each bin represents one unique value.

ajdapretnar commented 3 years ago

This doesn't happen only for single value per bin. I have a dataset with 56108 instances. The default visualization for a certain variable creates the following bins:

I would expect the following default bins: (, 49), [49, 50), [50, 51), [51, 52), [52,)

ajdapretnar commented 3 years ago

Alternatively, I would expect the bin not to stretch more than the other bins. If a bin represents a single value, then its width should be the same as other bins. For RAD, the first seven bins should be a single large bin, or the final two bins should be two narrow bins with empty space in between. No?

janezd commented 3 years ago

This doesn't happen only for single value per bin.

If bins' boundaries are not round decimal numbers, I guess they must represent single values. Don't you by chance have just 9 distinct values in your data? (We're not talking about single instances but about single values, right?.)

If a bin represents a single value, then its width should be the same as other bins.

If a bin represents a single value, then all bins represent single values and thus have various widths. In this particular case, all widths except the last were 1. But here are 17 instances from heart disease with 9 distinct values. Bar widths are 1, 2, 3 or 4.

Screenshot 2020-12-18 at 17 13 05

All we can do is to let the widths of all bins equal the smallest distance betwen two values (what is currently shown as the narrowest bin).

I've done so in #5139. Please report how this looks on your data.

ajdapretnar commented 3 years ago

We're not talking about single instances but about single values, right?

You're right, I confused the two.

I'll check the PR.