IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Two new examples to illustrate loops in R-Instat #8683

Open rdstern opened 11 months ago

rdstern commented 11 months ago

@N-thony and @lilyclements I am still hoping that we will be able to do loops in the January upgrade and (indeed) hope to mention this to Bob Muenchen as an important new feature in the new wonderful R-Instat!

This is a long story of a very simple example of a loop that could be useful. I give the whole story, because that links to help and datasets, etc. The punchline is a very simple loop that I think will be useful. It is similar to a loop that @Patowhiz said was also very easy. I start with Patrick's loop and then this one and interpret them both a loops by the levels of a factor - though the factor is a luxury, but useful for context.

a) We occasionally, in Excel, have a need to read multiple Excel files, with one sheet from each. We can read multiple sheets from a single Excel file already. Sometimes each climatic station, or perhaps each year, is in a separate file. Then Patrick said it would be easy to take the code for reading one year (or one station) and then generalise it to read them all in a loop.

If this works we could simply show a variety of ways we can do this loop over the names of the Excel files.

Then we could have a data frame with a variable giving these names. This could be a factor, and we could have a filter that includes some (or all) of these names. Then could we loop over the factor levels selected in the filter?

b) Could we have a really simple example. I am liking the one below, but it may bring unnecessary complications.

c) I can now get to the same problem with the interesting data starting from the Hadley Wickham nycflights13. There is now a companion package, in CRAN, called anyflights. This is NOT to be added to R-Instat, because I keep looking for a good package we don't want to add, and this is perfect. It is a package to allow you to add further years of data on flights, and also weather for years other than 2013.

So it is only useful in instances whether people are interested in that sort of data. And so, the anyflights package is a great example to illustrate our Tools > Install R Package dialog.

Once installed it seems really easy to add extra years of data, for multiple sites. But only one year at a time. Our new function in Insert for the get_weather function gives the following code:

### Name: get_weather
### Title: Query nycflights13-Like Weather Data
### Aliases: get_weather

### ** Examples

# query weather at Portland International in June 2018
## No test: 
## Not run: get_weather("PDX", 2018, 6)
## End(No test)

# ...or the original nycflights13 weather dataset
## No test: 
## Not run: get_weather(c("JFK", "LGA", "EWR"), 2013)
## End(No test)

# use the dir argument to indicate the folder to 
# save the data in as "weather.rda"
## No test: 
## Not run: get_weather("PDX", 2018, 6, dir = tempdir())
## End(No test)

I tried the line here: get_weather(c("JFK", "LGA", "EWR"), 2014)

I have not checked on the earliest year possible. I don't think it allows multiple years, so could we do a loop for that?

Initially we could just list the years in the code, e.g. 2000,2001, 2002, etc

Assuming it works, then what are the different years and places. This could be "for real", for example could we get the data from 1991 to 2020 for climate normals! But if the code has problems then we could choose a much simpler example.

The next step would be to put the years into a dataframe. Make it a factor and then choose the years we want as a filter. Now could we include this filter in the code. That's what I am trying to get to!

The other way of doing looks will be to use the names of variables, and then be able to use a select instead!

N-thony commented 11 months ago

@N-thony and @lilyclements I am still hoping that we will be able to do loops in the January upgrade and (indeed) hope to mention this to Bob Muenchen as an important new feature in the new wonderful R-Instat!

This is a long story of a very simple example of a loop that could be useful. I give the whole story, because that links to help and datasets, etc. The punchline is a very simple loop that I think will be useful. It is similar to a loop that @Patowhiz said was also very easy. I start with Patrick's loop and then this one and interpret them both a loops by the levels of a factor - though the factor is a luxury, but useful for context.

a) We occasionally, in Excel, have a need to read multiple Excel files, with one sheet from each. We can read multiple sheets from a single Excel file already. Sometimes each climatic station, or perhaps each year, is in a separate file. Then Patrick said it would be easy to take the code for reading one year (or one station) and then generalise it to read them all in a loop.

b) I can now get to the same problem with the interesting data starting from the Hadley Wickham nycflights13.

To be continued

@rdstern do we have specific example to include? @N-thony I have given an example now above.

rdstern commented 9 months ago

I have now found a very simple example for a loop. It is one that is pretty trivial if using commands and I can only see how to do it one variable at a time in R-Instat currently. It uses the state.x77 data from the datasets package.

image

Then on page 196 of the Using R for Introductory Statiastics book by xxx it has the following:

x <- sapply(as.data.frame(state.x77), rank)
rownames(x) <- rownames(state.x77)
heatmap(x, Rowv=NA, Colv=NA,
scale="column", # scale columns
margins=c(8, 6), # leave room for labels
col=rev(gray.colors(50))) # darker -> larger

I am just interested in the first line. In R-Instat we can already get the row ranks of multiple columns, but that's not what is wanted here. With the calculator I can get the column ranks, one column at a time. How could I get the set of column ranks and add them to the dataframe - or make a new dataframe if that is easier?

Oooh, @lilyclements here is Prepare > Column: Numeric > Transform dialog.

image

What if we add a checkbox (default unchecked) with label Multiple. It is above the word Column (top right in the dialog) and is for all the top buttons. If checked, then the Data Selector just shows the current Selects, rather than the numeric columns. And the receiver now writes a new set of variables?

Might that be possible, even relatively easy to code?

(With the Prepare: Column: Numeric > Row Summaries, Multiple option, Row ranks, we already have the code to write multiple columns!)

I still hope for looping (made easy) in the script files that are coming on nicely. But what do you think of this as a start on powerful looping within the menu system!!! Even R-shy could use this!

And building on this idea I would like to start with the above. This uses the Transform dialog. It adds limited looping, partly because the transform dialog is limited. (It is for those who find the calculator too scary. Once we can do this, I am keen to see what has changed in the code. We should then be able to make the same changes for at least some calculations (using the calculator), because the calculator is the same as transform, but more general. Can we then add loops for the calculator by: a) Doing the calculation for a single variable b) Using To script c) Making the same changes in the script, as are "behind the scenes" in the Transform dialog. d) Then adding a tab into the Insert dialog, to be able to add the needed lines as simply as possible.

rdstern commented 8 months ago

@N-thony I will be discussing the R code for this with @lilyclements on Friday. I wonder if someone could prepare the dialog in time for this. I think there would just be 3 additions. a) Add a checkbox, with label Multiple, default unchecked, above the label Column, at the top right. b) If checked, then the data selector will just have the Selects for each data frame. c) And the New Column Name control will change and become the same as here. Ideally it would have a default prefix, if a select is chosen, which would be the name of the select, followed by underscore. Ideally here is would be a suffix (not prefix), so after the name, not before.

image

decathlon.zip I attach a file (from factominer) where I have added 2 selects, one for track and the other for the field events. A simple loop would be to mutliply the metres by 3.28 to put them into feet. (I wonder if we could have an option to replace the data in the existing columns? The track events could be divided by 60, to put them into minutes.

N-thony commented 8 months ago

@N-thony I will be discussing the R code for this with @lilyclements on Friday. I wonder if someone could prepare the dialog in time for this. I think there would just be 3 additions. a) Add a checkbox, with label Multiple, default unchecked, above the label Column, at the top right. b) If checked, then the data selector will just have the Selects for each data frame. c) And the New Column Name control will change and become the same as here. Ideally it would have a default prefix, if a select is chosen, which would be the name of the select, followed by underscore. Ideally here is would be a suffix (not prefix), so after the name, not before. Ideally the default would be the name of the select.

decathlon.zip I attach a file (from factominer) where I have added 2 selects, one for track and the other for the field events. A simple loop would be to mutliply the metres by 3.28 to put them into feet. (I wonder if we could have an option to replace the data in the existing columns? The track events could be divided by 60, to put them into minutes.

@derekagorhom can you prepare the dialogue?

rdstern commented 8 months ago

@derekagorhom I have now has a discussion with @lilyclements who bravely said that her part in changing the R code for this option should be quite easy. So, could you give priority to your changes in this dialog, over your other tasks and perhaps bring in @N-thony if needed, so it can be done quickl;y?

lilyclements commented 8 months ago

If I take your example

x <- sapply(as.data.frame(state.x77), rank)

then we can do this with the purrr package in R:

state.x77 <- as.data.frame(state.x77)

x <- purrr::map_df(.x = state.x77,
                   .f = ~ rank(.x))

We just modify the rank function to add parameters, like usual:

x <- purrr::map_df(.x = state.x77,
                   .f = ~ rank(.x, ties.method = "first"))

(Note that the rest of your example does not suit this sort of data since it wants a matrix, and the heatmap code is not ggplot2 code. I can give this full example as ggplot2 code, but I assume that's not the aim of this exercise).

This works for other functions too.

# to set everything to be a character
purrr::map_df(.x = state.x77,
                   .f = ~ as.character(.x))

If we want to do something to just some variables, I suggest we use across and mutate:

selected_vars <- c("vs", "am")

mtcars %>%
  mutate(across(selected_vars, ~rank(.x)))
rdstern commented 8 months ago

@Vitalis95 I hope this message from @lilyclements is sufficent, for you to take over @derekagorhom #8862 branch. Derrick has quite a lot to do just now, particularly on the 2/3 way graphics dialog.

I note that the example given by Lily, uses all variables in a data frame. That is the "trivial" select, called .everything, so should fit in the "system". It looks logical, that the first line in giving access to the data takes the variables in the chosen select, and makes them into a data frame. Then the purrr will work on them.

lilyclements commented 8 months ago

@Vitalis95 @derekagorhom here's the code from our meeting

# from the "selected variables" receiver: 
survey <- data_book$get_data_frame("survey", column_selection_name = "selection")

# the code that is run (if the round function is selected):
survey <- survey %>%
  dplyr::mutate(dplyr::across(everything(.), ~round(.x)))

# rename our new columns with a suffix of "_transformed"
colnames(survey) <- paste0(colnames(survey), "_transformed")

# as normal, add columns to data
data_book$add_columns_to_data(data_name="survey",
                              col_data=survey,
                              before=FALSE)
Vitalis95 commented 8 months ago

@lilyclements , if we have a function which is an operator like power function or scale multiple as shown below; how can we apply the above code here? If we put this ~ before the operator there is an error.

yield <- data_book$get_columns_from_data(data_name="survey", col_names="yield", use_current_filter=FALSE)
yield1 <- yield^-5
yield1 <- (yield - 0) * 24
lilyclements commented 8 months ago

@Vitalis95 I'm a little confused why you're doing "get_columns_from_data", I thought we said to use "get_data_frame" with the column_selected_name = "<selection name>"). Was there a problem with this code?

E.g., for round, we said to do:

survey <- data_book$get_data_frame("survey", column_selection_name = "selection")
survey <- survey %>% dplyr::mutate(dplyr::across(everything(.), ~round(.x)))
data_book$add_columns_to_data(data_name="survey", col_data=survey, before=FALSE)

Using this format, for your example with yield^-5, we just replace ~round(.x) with your new code, where .x is yield:

survey <- survey %>% dplyr::mutate(dplyr::across(everything(.), ~.x^-5))

For your example with yield - 0 * 24:

survey <- survey %>% dplyr::mutate(dplyr::across(everything(.), ~.x - 0 * 24))
rdstern commented 1 month ago

@lilyclements I now suggest a simple example to show how loops can be added to a calculation.
In the ggplot2 data the second example is called economics. We use it in a video on line plots. The next example is called economics_long and is the same, after a calculation, that appears magically and has to be repeated 5 times.
I do the calculation 5 times in the video.

It is an example that I think could be done once and then looped through in a script. That to illustrate how repeated calculations can be done easily.

Here it is in the dialog:

image

This is then repeated for the other 4 variables.

I am sure there are alternative ways of doing this in a script? Can we give some examples. We now have the beginning of a script library.

Another example that Patrick says is easy is multiple Excel files with one sheet on each file. We can do multiple sheets from one Excel file and (I think) multiple files, if csv, etc. But not multiple Excel files and I am not keen to encourage that.