biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.82k stars 1k forks source link

Regression Line in Scatter Plot #5542

Closed bluewin4 closed 3 years ago

bluewin4 commented 3 years ago

What's your use case?

The regression line is cursed, as a user, it gives an implication of something like a Pearson regression and the associated coefficient. When I realized the given value was just the slope of the line it was weird, which was compounded by the fact that when it goes negative the whole thing flips over. image

What's your proposed solution?

When it says show regression line gives the associated equations and information from it, it would help greatly. Heck, if you could change from different forms of regression that would be really comfy directly in the scatter plot GUI.

Are there any alternative solutions? Unsure, y'all could probably figure something elegant out. Just knew it was bugging me.

markotoplak commented 3 years ago

@bluewin4, flipping over. :D

Why did you conclude it is a slope? For me, it was always between -1 and 1, so it definitely not a simple slope.

For me, it would also make sense if r was some regression coefficient. How did you conclude it is not?

markotoplak commented 3 years ago

By the way, do axes always look like this for you? Did you perhaps switch monitors while Orange was running? Or are you using multiple monitors?

markotoplak commented 3 years ago

I checked the code. The r comes from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html. This should be, in my understanding, equal to the Pearson correlation coefficient.

bluewin4 commented 3 years ago

I guessed it was a normalized slope coefficient because of how it was for an r and it seems to only be a linear fit that changes its slope with the regression value instead of with whatever slope the linear fit would have. Did that make sense? I'd expect a fitting coefficient to be independent of the slope of the line being used for the fit, but here it just draws a line that has the same slope as the regression coefficient.

If it's a regression coefficient then I need to know what sort because, as it stands, when I use other programs to calculate a standard Pearson regression it is at best 0.83 for, closer to 0.819 when I adjust. But here it gives me a value of -0.91 which is way off. Without information about the RMSE and other aspects of the regression, such as the equation used, what's the point beyond cluttering my interface?

yeah, I'm using a second monitor, there are already a couple of issues open on it though so I didn't want to bother people about it.

Thanks so much for the help I really appreciate it :)

markotoplak commented 3 years ago

Look, with (I tried reading the numbers from the graph)

x = [34.1, 36.7, 39, 40.3, 41.3, 42.5, 43.8, 45.6, 46.2, 47, 48.7, 50.1, 55.8]
y = [10.7, 8.2, 5.3, 3.4,  -1.5, -6, -10,    -11.5, -16, -15.5, -14, -14.5, -16]

import scipy.stats

result = scipy.stats.linregress(x, y)

print(result)

print(scipy.stats.pearsonr(x, y))

I get

LinregressResult(slope=-1.5401906522489612, intercept=61.70791396149089, rvalue=-0.9091123848442599, pvalue=1.6685129923709016e-05, stderr=0.2127789531697855, intercept_stderr=9.424433768419346)
(-0.9091123848442602, 1.6685129923708795e-05)

Which means that the number perfectly matches https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

bluewin4 commented 3 years ago

xStrange, when I run this in Matlab I get a very different value, although that's probably cross-platform weirdness. Although it's nice to know it's a Pearson coefficient now, thanks for going through all this work on my account.

I guess when I see something that says regression line I assume the line shown is the linear fit instead of a representation of the Pearson coefficient, I think that's what had me all confuzzled. Or, wait, is that just a weird math artifact. Would it be particularly difficult to pull up the rest of the fitting information? I think it would be useful as heck and I know at the least it would give the scatter plot tool a lot of added utility and ease of understanding. Of course, if I'm wishing for things, the ability to change the fitting algorithm so it's not always linear would be aces.

Again, thanks for your kind attention 😄

markotoplak commented 3 years ago

Could you please show what do you get in Matlab with the same values as I posted?

About different fits: check https://github.com/biolab/orange3/pull/5481 .

bluewin4 commented 3 years ago

image

Here's the resulting fit for this data using linear regression of mx+b form. Is the difference that this is using an R-squared value? Why do you not use one for Orange?

Oh, that's cool! Hopefully, I can do my thermodynamic fits in there and really make a whole workflow for thermodynamic characterization using FTIR in orange.

markotoplak commented 3 years ago

Ah, I see that your fits report R-square - Orange instead reports R.

So, 0.8189**0.5 = 0.905. The same. :)

janezd commented 3 years ago

We need to check and improve documentation.