JuliaPlots / StatsPlots.jl

Statistical plotting recipes for Plots.jl
Other
434 stars 87 forks source link

qqnorm not showing the highest datapoint & qqplot not showing 2nd-to-highest datapoint #461

Open HiramTheHero opened 2 years ago

HiramTheHero commented 2 years ago

Version information julia version 1.6.2 StatsPlots v0.14.26 Plots v1.19.3 Hey everyone. I was messing around with qqnorm and noticed that it was neglecting to plot the highest point in the dataset I sent it. The dataset I'm using is located here.

  1. I loaded in the data. data = CSV.File(filePath) |> DataFrame

  2. Filtered the data into two datasets by gender. HHSgirls = subset(data, :Gender => ByRow(==("Female")), skipmissing=true) HHSboys = subset(data, :Gender => ByRow(==("Male")), skipmissing=true)

  3. Ensured that no values from the Reaction_time column were missing values. HHSgirlsClean = dropmissing(HHSgirls, :Reaction_time); HHSboysClean = dropmissing(HHSboys, :Reaction_time);

  4. Then I put the data into qqnorm. qqnorm(HHSgirlsClean[!,:Reaction_time], yaxis="Female Reaction Time", qqline = :fit) I get the following plot. file

The issue is that the plot is neglecting the highest value of the column I put into qqnorm(). (Which value is 46.) maximum(HHSgirlsClean[!,:Reaction_time]) 46.0

If I extend the y-axis (and x-axis to be safe) limits to include where the point should be, the point is still missing. qqnorm(HHSgirlsClean[!,:Reaction_time], yaxis="Female Reaction Time", ylims=(-5,50), xlims=(0,15), qqline = :fit) Plot from the code directly above. file2

Same thing happens with the boy dataset. qqnorm(HHSboysClean[!,:Reaction_time], label="Male Reaction Time",qqline = :fit) file3

maximum(HHSboysClean[!,:Reaction_time]) 1000.0

Graph with the extended axes. file4

Just a note about the above. Forget the titles on the graphs. I forgot to eliminate them.

However, interesting enough, if I try the same process with the qqplot function, the 2nd to highest point is neglected in the plot.

sort(HHSgirlsClean[!,:Reaction_time])

Output in Julia REPL 0.0489 0.139 0.142 0.148 0.23 0.25 0.261 0.27 ⋮ 3.0 3.0 4.2 5.0 7.129 10.0 30.0 46.0 Just to clarify, 30 is the 2nd to highest point.

Setting up qqplot function with a normal distribution. normDist = rand(Normal(), 100)

Plotting qqplot(normDist, HHSgirlsClean[!,:Reaction_time], qqline = :fit)

Result file

Note that 30 is not included in the graph.

Same with the boy dataset...

sort(HHSboysClean[!,:Reaction_time])

Output in Julia REPL 0.0417 0.06 0.084 0.1 0.1999 0.202 0.212 0.223 ⋮ 1.2 3.0 5.0 6.0 6.0 6.7 404.0 1000.0

To clarify, 404 is the 2nd to highest point.

Plotting qqplot(normDist, HHSboysClean[!,:Reaction_time], qqline = :fit)

Result file2

Notice that point 404 is missing from the graph.

I'm a bit worried that I may be doing something wrong. So, please let me know if this is a user-error on my end. Also, I am using Visual Studio Code. Not sure if that would cause issues.

sethaxen commented 2 years ago

Thanks for the issue. A minimal working example that shows this behavior is qqnorm([1,2,3]): tmp

However, note that this example is equivalent to calling

using Distributions
qqpair = Distributions.qqbuild(Normal(), [1,2,3])
plot(qqpair)

Note that qqpair just wraps the x and y values of the desired points, which we plot. So I would advise opening an issue on Distributions.jl.

scheidan commented 1 year ago

This is now solved with Distributions v0.25.89. Not sure if it is worth to increase the lower bound.