Closed rdstern closed 3 years ago
To respond to the end of this issue - I will look at this and comment by the end of the week.
b) There are two sort_...
functions
sort_type
applies to vis_dat
and is default TRUE
. Is this the one you want in, and for the default to be FALSE
?
Answer: yes I would like the option to be able to set it as TRUE. But the default is not to sort, so FALSE. I found it confusing when I did nothing and it changed the order of the variables.
There is also sort_miss
in the vis_miss
function which sorts the columns by missing to non-missing. This is default FALSE
. Do you want this one in as well?
Answer: Yes, might as well. Here I am happy with the default of FALSE, but nice extra to have.
d) The large_data_size
argument corresponds to the number of cells in a data frame.
For example, the airquality data has 153 rows and 6 variables (918 cells):
# airquality has 918 cells. If we set the large_data_size to be equal to or greater than this, then the function runs
vis_miss(airquality, large_data_size = 918)`
# however, if we set the large_data_size to be less than the number of cells, there is an error.
vis_miss(airquality, large_data_size = 917)
Error in vis_miss(airquality, large_data_size = 917) :
Data exceeds recommended size for visualisation, please consider
downsampling your data, or set argument 'warn_large_data' to FALSE.
The authors of the package suggest taking a sample of the data set if it is particularly large. Personally I haven't had issues of a blank plot, however, perhaps could it be worth to offer a checkbox "Take Sample of Data". This can give an input or nud where the user can select a value for the number of rows to sample (maximum should be the total number of rows in the data or the value that large_data_size
is set at, whichever is smaller; minimum could be 1; default could be the maximum value perhaps?).
Then in the R-code, instead of running the data frame in the x
argument, instead the following should be run x = sample_n(tbl = <<data>>, size = <<number of rows to sample>>)
e.g.
visdat::vis_miss(x = dplyr::sample_n(tbl = airquality, size = 50))
This can work for all three functions in this dialog (however, for vis_guess
the maximum value you can input should just be the number of rows in the data set).
Do you think this is worth offering?
Answer: Good to know that the size corresponds to cells of data. So the maximum size control can say perhaps
Maximum size 9 00,000 data points. And you can change the 9, presumably up.
If more is needed, then either the number of variables can be reduced, or a filter can be applied, or there could be a sample - as you say. I think I'd prefer a Sampling Fraction up down, with a default of 1. Maybe an up-down with a minimum of 0.01 and steps of 0.01 up to 1. I assume this could work easily as a command - the number of rows would just be the integer value of the fraction times the length?
Ok @Wycklife I think, following Lily's comment above - with my answers - we are just about ready to go on this dialogue.
There will be 4 new controls on the dialogue. They are not all on each function, i.e. the radio buttons at the top.
a) The dialogue will have to be longer.
b) A checkbox on the left with label Sort Variables
for vis_dat and vis_miss. The default is unchecked for both. For visdat this will be a small change in the code, because the default now is Sort = TRUE.
c) A drop down on the left with label Palette
and the 3 options for the palette - this seems to be just for vis_dat and vis_guess, i.e. perhaps not for vis_miss
d) An up-down on the left with label before Maximum Size
then the control and after the control another label Million Data Points
. In the up-down, the default is 0.9 (which is the default for the functions). It is from a minimum of 0.1 in steps of 0.1. This is for all 3 functions.
e) Perhaps on the right of the dialogue, i.e. under the receiver: 'Sampling Fraction` up-down, with default 1 and going down with upper limit 1 and steps down of 0.01 to a minimum of 0.01. This is for all 3 options
This last one will need @lilyclements or @dannyparsons to advise on the code. This should be available by the time you need it, but the option could be disabled if not.
@rdstern to
d) The large_data_size
option is only for vis_dat
and vis_miss
. If I am understanding correctly that this control relates to that parameter, then this option should not be present for vis_guess
e) Is the sample fraction option a fraction of the number of rows to sample from? If so, then the x
parameter in the vis_
function should equal the slice_sample
function in the dplyr
package. This reads in two parameters .data
which is the data set, and prop
which is the value in the up-down.
E.g. if you wish to sample 0.1 of the rows from the airquality data (for vis_miss
) then the code would be:
vis_miss(x = dplyr::slice_sample(.data = airquality, prop = 0.1))
Presumably, if the nud for the Sampling Fraction is 1
then the x
parameters in the vis_
functions should still be the data frame (or selected columns). This is because if we use vis_miss(x = dplyr::slice_sample(.data = <dataset>, prop = 1)
the rows are no longer in the order that they were in the data set (and it is running code that is not required).
If the nud is less than 1
, the x
parameter in the vis_
functions should equal the dplyr::slice_sample
function.
This is working well now. Could we make a very minor change at least and perhaps add some further options.
a) The very minor change is that currently on Visualising data the default is sort = TRUE, which I find confusing, and would strongly prefer sort = FALSE. (I use that in a proposed video and have to make the change in the script window) b) Better would be to add some of the options, now that we have a working dialogue. Could we have a checkbox called Sort Variables for this option? Default is unticked (which is sort = FALSE). I think this sort option may apply to the Missing option, but not to the Guess. c) Perhaps, at the same time we could consider other options. In particular add Palette for the Data and Guess. It doesn't seem to be there for Missing. Perhaps it could be 3 (ordinary) radio buttons. d) Should we also have Maximum Size? If so, then perhaps that could be an updown with 9 as the minimum and going up one at a time. It says Maximum Data Size 9 hundred-thousand bytes. I assume it is bytes?
I hope @lilyclements could look at this and comment quickly, and perhaps specify in more detail. Then (unless she wants to make the changes) I suggest @Wycklife could do this. If this will take some time, then please could @Wycklife make the change in item a.