biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.87k stars 1.02k forks source link

OWCorrelations: fixes and enhancements #3504

Closed ajdapretnar closed 5 years ago

ajdapretnar commented 5 years ago

Fixes:

Enhancements:

janezd commented 5 years ago

Split attribute pairs into two columns (in view) and elide as in Data Table.

In Vizranks, we went in the opposite direction - merging multiple columns into one. Two columns don't function well because these are not ordered pairs. Particularly not in correlations (in scatter plot it was at least x and y). You don't get any additional functionality by having two columns (sorting by column does not help you find occurrences of some attribute, for instance), and it's just worse visually.

Add p-values to Correlations table output.

P-values in this context are useful as an absolute, "normalized" correlation. But I'm rather hesitant about this feature because I fear that people would treat these p-values as significance (in terms of null-hypothesis testing), ignoring that they are formulating hypotheses from data. I would be however OK with computing p-values but presenting them as "normalized correlations" or something similar ("normalized" can be ambiguous). The normalization procedure would be described in documentation and tooltip.

ajdapretnar commented 5 years ago

You don't get any additional functionality by having two columns

We have discussed this with @lanzagar, who is also using this widget and while I also thought it is worse visually at first, now I am a believer. The idea is that you can easily read two names of different length. Try using a data set with names of different length and you will see how annoying this is. Also it would be nice to be able to filter by variable, get that variable in column 1 and see the paired attributes in column 2.

I would be however OK with computing p-values

For now, we would have p-values only in the Correlations table output, not visually in the widget. I got a request by a user as to how on Earth do we not have p-value reported for correlations, hence this enhancement. We can easily add this in the docs, but not sure about the tooltip, if we keep it in the output only (where would this tooltip be?).

janezd commented 5 years ago

Try using a data set with names of different length and you will see how annoying this is.

I'm not sold on this one. If names are of very different lengths, two columns won't help much. But the change is trivial, so we can try and see.

Filtering is already possible by just typing the name.

I got a request by a user as to how on Earth do we not have p-value reported for correlations

Have you asked him why on Earth would he like to see them? :) I suspect he wants to have an automated way to look for "significant correlations", which is exactly why we should not have p-values.

I could write an essay about why we shouldn't compute p-values here, but in summary: correlations are already on the scale from -1 to 1, so no scaling/normalization is needed. P-values are just a very crappy way to take the number of observations into account, which is wrong in many different ways. The proper way to do this is to use Bayesian estimation of correlations, but it uses Monte Carlo and may be too slow for a large number of pairs.

lanzagar commented 5 years ago

The two columns idea was actually @BlazZupan's. We were looking at some data and he mentioned it would be better to have two columns and I think it is at least worth exploring (hard to make final judgments before seeing it). I definitely agree that the current state has shortcomings - try looking at correlations in the wine dataset and e.g. filter by OD280 (long, awkward name). It is very hard to scan the list and read/find the name of the other variable. I realize the implications of either making this different from all other vizrank GUIs or changing them as well... Maybe there is an argument about correlations being different to most vizrank lists, because I think nobody usually searches the vizrank results for a specific pair (but just clicks on the top few and checks the results in the main widget), while specific variable correlations are often of interest and you search for them even if they are low in the list ("are these two variables correlated?" "no, their correlation is quite low").

lanzagar commented 5 years ago

Regarding the p-values: I am also skeptical about promoting them too much. I think they are not interesting/informative in a lot of cases, so it is hard to find a good place for them. I managed to persuade @ajdapretnar to not try to include them in the widget, but just add an extra column with them in the table output. I think that is a good compromise of most people not seeing them, but they can be found and checked if someone really wants to.