IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Improve the R code in parallel coordinates plot and correct the principal components dialog. #8985

Closed rdstern closed 2 weeks ago

rdstern commented 1 month ago

@Vitalis95 I would like to improve the R code that we generate and use in the Describe > Graphs > Parallel Coordinates Plot.

There is also a serious bug in the R code generated in the principal componen ts dialog, which I hope can be corrected. The commands seem to be duplicared and don't work.

The bug is quite urgent to fix. We can't release a new version until that is done. At the same time the multivariate dialogs (using the Factominor package) have a similar feature to the parallel coordinates plot function. Hence, possibly my suggestion below, may help the R code for those dialogs too.

I use the decathlon data from factorminer, where I have added a special factor column. Here are the data:

decathlon3.zip

The problem, in the current R code, we generate is that the GGally::ggparcoord function prefers to use the numbers of the columns in the data frame, rather than their names. (I think @Vitalis95 found this already to be a problem in the multivariate dialogs and he may have a better solution, compared to my suggestion below?

@lilyclements could you please check my suggestion and improve the R code towards the bottom if you can? Here is the current R code for an example I produced here:

image

# Dialog: Parallel Coordinate Plot

decathlon <- data_book$get_data_frame(data_name="decathlon")

last_graph <- GGally::ggparcoord(data=decathlon, columns=c("X100m","Long.jump","Shot.put","High.jump","X400m","X110m.hurdle","Discus","Pole.vault","Javeline","X1500m"),
 groupColumn="rank3", scale="centerObs", missing="exclude", order="skewness") + theme_grey()

 data_book$add_object(data_name="decathlon", object_name="last_graph", object_type_label="graph", object_format="image", object=check_graph(graph_object=last_graph))
data_book$get_object_data(data_name="decathlon", object_name="last_graph", as_file=TRUE)
rm(list=c("last_graph", "decathlon"))

The problem with the code is the order="skewness" argument. This presents the columns (on the x-axis) in the order of their skewness, while I would like them to be in the order I gave in the dialog. I think what I want is the default if the columns argument is numeric, here c(1:10). So replacing the main lines by:

last_graph <- GGally::ggparcoord(data=decathlon, columns=c(1:10),
 groupColumn="rank3", scale="centerObs", centerObsID = 1, missing="exclude", order=c(1:10)) + theme_grey() 

gives the graph I would like.

Here is "my" solution - of course using stack-overflow, here.

decathlon <- data_book$get_data_frame(data_name="decathlon")
column_numbers <- match(c("X100m","Long.jump","Shot.put","High.jump","X400m","X110m.hurdle","Discus","Pole.vault","Javeline","X1500m"),names(decathlon))

last_graph <- GGally::ggparcoord(data=decathlon, columns=column_numbers,
 groupColumn="rank3", scale="Std", centerObsID = 1, missing="exclude", order=column_numbers) + theme_grey()

 data_book$add_object(data_name="decathlon", object_name="last_graph", object_type_label="graph", object_format="image", object=check_graph(graph_object=last_graph))
data_book$get_object_data(data_name="decathlon", object_name="last_graph", as_file=TRUE)
rm(list=c("last_graph", "decathlon"))

@Vitalis95 If lily is ok with the code, which is mainly the line with the match command, which then gives the variable to use, with the required numbers, then can you please make the changes. Note also 1) I have added centerObsID = 1 into the command. Could you also include this. It is ignored except for one scales option, but does no harm. (I don't want to complicate the dialog and this will be an easy tweak when needed.) 2) I presume we should include the column_numbers variable in what is removed in the last line?

rdstern commented 1 month ago

@lilyclements when we next meet, here is a topic for you to advise quickly.

image

In our plot how could I emphasise some levels of a factor - so make those lines stand out more? The examples in the guide seem to indicate one way, and it is a general change, because the initial output is just a ggplot. I hope it will be just a feww minutes play when we next meet?

Vitalis95 commented 1 month ago

@lilyclements , the line of code of getting column indexes suggested by Roger, is simpler and straightforward . I can implement it that way