IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Describe > Two/Three Variables > Summarise #4952

Open rdstern opened 6 years ago

rdstern commented 6 years ago

There is a lot to improve here.

I suggest a few steps, so (at least) the dialogues are consistent. But then I think discussion is needed with @dannyparsons and @volloholic to check on the strategy. Once that's agreed, and I would like to write the discussion points in this issue, then I suggest it could be an interesting and important task largely for @Muthenya , with support from the others?

So an initial suggestion. The 2 dialogues remain inconsistent and should not be. Summarise has a single receiver first and then a multiple receiver. Graph has the same, but the other way round.

I suggest that most people will be considering 2 specific variables when they visit this dialogue. So could we have the same idea as in the specific graphs, namely a) The first receiver is always a single receiver. b) The second one is also a single receiver (by default), but with the same button as on the initial receiver for the Describe > Specific > Boxplot, etc. So it says Single, and can be changed to Multiple, in which case it becomes a Multiple receiver.

I don't necessarily expect @dannyparsons and @volloholic to agree, even with this, but propose that the extra button will allow the idea of the dialogue to be explained clearly.

rdstern commented 6 years ago

Next proposal for these 2 dialogues: Currently the main dialogue hides anything to do with the type for the variables. There are currently 4 options, namely: 1) Numeric by Numeric 2) Numeric by Factor 3) Factor by Numeric 4) Factor by Factor

On the Describe > Two Variable > Summarise there is currently a sub-dialogue - I guess there will be! On the Describe > Two Variable > Graphics there is:

image

So, suddenly you do have to understand this idea! I suggest, instead, that we have the new-style radio buttons at the top of each (main) dialogue. This will make the selection of the variables easier. Then the sub-dialogues could also be tabbed and the Options button takes you to the correct tab.

Muthenya commented 6 years ago

@rdstern is this still awaiting further discussion?

Muthenya commented 5 years ago

@rdstern?

dannyparsons commented 5 years ago

I would like to make some progress on improving this. Here are some suggestions:

For the summarise dialog the summaries will be:

This is sort of how it works now, so I think this was discussed before.

Questions

rdstern commented 5 years ago

I wonder, with the two variable situation, whether we (at least) have two radio buttons called By and 2 Variables? The By is simply the one variable dialogue with results By a second variable. This would be like the grouped data frame idea in dplyr.

I suggest we consider the three variable dialogues (and possibly add a 4 level item to the menu at the same time. I suggest this will be useful, and continue David's initial idea.

So the three-level could (at least initially) be simply 2 By, and By. For consistency we might include a 3rd button which is 3 Variables, but this would be disabled for now.

Here the 2 By is the same as the one Variable multiple receiver, split by 2 factor variables. The By is all the 2-variable options split by one factor.

If it looks useful, then we could add the 4 Variables situation, which might just be the 2-variables summaries split by 2 factors.

Of course there are other options for 4 variables, but many analyses seem to stop at 2-way tables, etc, (and there is a fair bit to teach here) and we do want to encourage users to move to the more general situation. So, at least for now, I suggest we don't worry about too may 3-variable tables etc,

This split will mean that the one-variable by will allow the multiple receiver to permit any, or all variables, while the tow variable summaries can restrict the multiple receiver to be of a single type.

Allowing the By up to 2 factors also fits well with the graphics, where the default can be for a by to be a facet.

dannyparsons commented 5 years ago

How should date columns be treated? We had thought like a numeric column, but you can't do correlations with them like a numeric column, and they also can't be used as the response variable in an ANOVA table. We can either convert them to numeric and use the underlying numbers or date types could be excluded from the selector if we don't want to use them in these cases.

rdstern commented 5 years ago

Interesting. Is this a logic, or an R question. I can think of examples where correlations or regression could be useful. 1) I use ODK for a survey. For each respondent it includes the interviewer, the date/time when the survey process started and the duration of the survey in minutes. I would like to know the relationship between the date/time and the duration. 2) I record the date of the start of the rains and also the latitude of the farm. What is the relationship between the start date and the latitude?

Is this more a question of a suitable origin being needed, or perhaps there is usually a logical start so it is a difference in dates that is being used. In many studies there is a natural origin (often zero), while dates have an arbitrary origin. I get this problem when looking at trends in temperature, with year as the x - but could be daily with a date as the x. Then (with year) the origin is year zero, which is a long time ago! In a practical sense this can mess up the regression modelling, so better to have a more sensible origin. Is the paper by Cox useful here? Perhaps the issue of it being a date is less relevant than the fact there are instances where just making it numeric is not sensible, because the variability of the different date/times may be very low compared to the size of the observations?

dannyparsons commented 5 years ago

The question was sort of both, in R these give an error, but you can convert to numeric to get days since 1970/1/1. As you say I think the origin is arbitrary, since its the differences and not the actual values that are of interest usual. So I think its sensible to treat dates as numeric for this dialog.

rdstern commented 4 years ago

I assume this is the place to comment on the 2-variable summarise. I am using @dannyparsons new version.
a) A detail - the dialogue title for the Graph is Two Variable Graph - which seems sensible. Here the Title is Describe Two Variables. I suggest Two Variable Summary. b) I tried the same b) The graph dialogue now has some useful options. Could we please have the same for the summary. In particular they would be great for the situation with categorical by categorical.

Currently you only get the Counts. You don't get the margins - I think?. Could we have a box with the 4 options as check-boxes, namely Counts, Row%, Column%, (or Col%), Cell%. Ideally you could have them all. If that is a mess (maybe, because there are also multiple sets of tables) then perhaps you initially choose one of them. Or, if there is just one variable in the first receiver, then you allow all, but if more than 1, then you choose a single option. (There would be some sense to that in that you would then (to some extent be comparing the different variables, and hence you choose on which summary to compare them.

The margins are also interesting. Sometimes you want one, but not both. Initially I would be happy with a single checkbox so you either get the margins or not. Ideally there would eventually also be another checkbox, perhaps only visible when you ask for one of the percents. It could be labelled as "Counts for 100%". (This is what Genstat does as default.) The 100% is useful for teaching, as it reminds you what is 100%. But once you know that, then the table is much more useful if the 100% is replaced (perhaps an option to add?) by the Counts - that answers the question of "percent of what".

rdstern commented 3 years ago

@Ivanluv the suggestions just above are for the situation with categorical by categorical. There is also the trivial one of changing the name on the top of the dialogue to Two Variables Summarise. Then all options may change, but the urgent one is Categorical by Categorical. Here is the current dialogue and results for the standard rice survey data: image

There are no options and the display is pretty awful!
I suggest this dialogue could become one of our "work-horses" for many users doing simple analyses. They often do look at the results from 2 variables. So the initial set of suggestions is to have a good display of 2-way frequency tables, and this makes a set of tables that are special cases of our new Describe > Specific > Frequency Tables dialogue. The layout shown above could still be included, perhaps with percentages included and these could then be saved into a new data frame. It would be slightly different, because count would be the name of the variable - and there could be others (percents) and there would be a first variable called Table. And there should be an option "Include zeros". This - with these changes, would also be one option for the display in the output window.

Ivanluv commented 3 years ago

@rdstern should I use the sjPlot::sjtab function as the one in Describe > Specific > Frequency Tables dialogue to implement the improvements you have suggested above?

rdstern commented 2 years ago

@Ivanluv now I see at least one example of the questions where you expected an answer - and I missed it, and you didn't remind me! Also perhaps that you would like more specific direction - though (as a programmer) you are asking more detail from me than (as a user) I have. I assume sjtab is a possibility. But I was hoping it would be a simple case for mmtable2 as the default. That is what we are using for the general tabulation now, so it would be a nice introduction to that package to use it here (and in the 3-way, once we get that dialogue? That may be obvious to you now, but otherwise, perhaps @lilyclements could confirm or deny?

lilyclements commented 2 years ago

Using mmtable2 here seems like a good solution, and implementation-wise (I assume) is very similar to the work already done in the summaries dialog.

Ivanluv commented 2 years ago

@lilyclements the object produced by data_book$frequency_tables(data_name="survey", x_col_names=c("village","fertgrp"), y_col_name="variety", store_results=FALSE, as_html=FALSE) is of typeOf NULL .How can I have it passed to mmtable ?

lilyclements commented 2 years ago

@lilyclements the object produced by data_book$frequency_tables(data_name="survey", x_col_names=c("village","fertgrp"), y_col_name="variety", store_results=FALSE, as_html=FALSE) is of typeOf NULL .How can I have it passed to mmtable ?

It cannnot be passed to mmtable2 in it's current form. If you run this code in R, you can see the output is two tables

image

Out of interest, how does frequency_tables differ to using summary_tables and limiting to the frequency-type variables? (like we are in the frequency tables dialog) Is there a reason this is being used here?

rdstern commented 2 years ago

I wonder if this starts to raise the more general question of whether we have a separate Describe > Specific > Frequency and Summary tables dialogue? Should we consider having a Tables dialogue with a frequency and Summary button at the top?

The main differences in the frequency tables is just the need for the percentages.

dannyparsons commented 2 years ago

That sounds sensible.

rdstern commented 1 year ago

@derekagorhom I suggest this is a topic where you should aim for progress while @lilyclements is in Ghana with you. I mention a lot of components and we will almost cwertainly want to merge improvements while waiting for others.

Here is the current dialog:

image

The main suggestion below is to implement the 3-variable summarise. There are also improvements needed to the 2-variable cases and I consider them in the next comment. I hope you will find some code already for the 3 variable stuff.

The receivers will be similar to this:

image

They will be labelled Second Variable: and Third Variable:

On the left the screen will resemble this, but with the 3 variables. I assume there may be existing (commented) code for this?

image

This time there will be eight combinations and I would like you to start with 4 of them. Namely if the first (multiple) is numeric, then it gives ANOVA Tables for any of the other combinations of the second and third.

The code for the ANOVA option for 2-way should be your guide, i.e.

image

(Note that is currently the other way round for 2 variables, i.e. categorical by numeric, and I suggest changing that below.)

In the option there is a checkbox labelled Interaction, default unchecked. (If unchecked the model will have a plus, if checked it will have a star - @lilyclements can explain if needed) There is a second checkbox, default unchecked, labelled P values.

rdstern commented 1 year ago

Three way tables when the multiple receiver is categorical. a) Categorical by Categorical by Categorical: Summary is Frequency tables That should be easy adapting the code from the 2-way frequency table code. (Currently just one table in multiple receiver used - ignore the others) b) Categorical by Categorical by Numeric: Summary is 2-way Summary tables. Should be easy, because the code from the one-way summary table can be adapted. (Currently just on table in multiple receiver used.) c) Categorical by Numeric by Numeric: Can have correlations for each level of the factor. d) Categorical by Numeric by Categorical? Blank for now and Ok not implemented.

lilyclements commented 1 year ago

For a categorical multiple receiver

Type @rdstern's Comment R Code
Categorical by Categorical by Categorical Summary is Frequency tables That should be easy adapting the code from the 2-way frequency table code. (Currently just one table in multiple receiver used - ignore the others) For this we would have all the variables in the three receivers read into the factors parameter
Categorical by Categorical by Numeric Summary is 2-way Summary tables. Should be easy, because the code from the one-way summary table can be adapted. (Currently just on table in multiple receiver used.) For this we would have the two categorical receivers read into the factors parameter, and the numerical variable read into the columns_to_summarise parameter
Categorical by Numeric by Categorical Same as CxCxN Same as CxCxN
Categorical by Numeric by Numeric Can have correlations for each level of the factor. DATA %>% dplyr::group_by( <Categorical Receiver Variable> ) %>% dplyr::summarise(cor(1st Numeric Receiver Variable, 2nd Numeric Receiver Variable))

For C x N x N

An example with the diamonds data of what to expect -

diamonds %>%
  dplyr::group_by(color) %>%
  dplyr::summarise(cor(x, y)) %>%
  gt::gt()
rdstern commented 1 year ago

@lilyclements many thanks for that. With your nice neat layout above why not have another summary table from Categorical by Numeric by Categorical? So it is the same as Categorical by Categorical by Numeric?

lilyclements commented 1 year ago

@rdstern I've amended my table to reflect those changes :)

rdstern commented 1 year ago

@derekagorhom can you try this one? Perhaps even share with Raphael, if he is ready? It is a good one to build carefully on the 2-variable code and includes a lot of statistics too! It would be good for Sabi to test.

rdstern commented 12 months ago

@derekagorhom I can understand why you have been quiet on this one, given all you have been doing concerned with the AIMS course. Are you happy to work on this one, once that is over, or do you have too many other tasks just now?

derekagorhom commented 12 months ago

@rdstern sorry for the late reply, yes i will work on it next week but if someone else would like to attempt it. that is fine with me

rdstern commented 11 months ago

@derekagorhom this is an important dialog we need to get working. I am starting to be concerned that you may be spending too long helping on the new visualise dialog, which is fun but much less important. I had hoped that work on this one might have started while Lily was visiting. Now it could involve @fran2or for support. I'd be happy for him to be spending a bit more time on R-Instat, and this one is also now in the climatic menu as well as in describe.

rdstern commented 11 months ago

@derekagorhom you have been very quiet these last 2 weeks. Is everything ok?

derekagorhom commented 11 months ago

@rdstern sorry for being quiet on this issue. I was able to implement CxCxC and CxCxN for this option. i am having problem adding CxNxN function because of how the summaries were programmed image for the three variable option the second variable only displayes catergorical even when it is a numeric value... I was hoping to get it fixes with antoine by monday.

rdstern commented 11 months ago

@derekagorhom that's great - many thanks. I was only concerned if the work hadn't started. Four of the 8 options are N by something, by something and we assume all can be ANOVA, so they should be quick, once you start on them.

rdstern commented 6 months ago

@Vitalis95 I have yet to check your recent pull request. But I had a good discussion with @volloholic and am now ready to list the way I suggest this dialog - with our summary metods should work. I am now even more comfortable with the main change, starting with the 3-way, that when the multiple (first) receiver is numeric, then the summary should include anova. Here I am therefore specifying what the 2-variable should do. The next entry will move to the 3-variable. But I suggest we merge one the 2-variable is presentable - and keep the 3-variable then hidden. We may even merge when the 2-variable is sort of ok, even if all the improvements are not yet implemented.

1) So here goes for 2 variable - initially: a) Multiple Categoric, second Categoric,gives frequency tables, one table for each of the multiple receiver. That is as now, but I think currently it may put everything into one big frequency table. This is now a set of separate 2-way frequency tables, now we can do that. b) Multiple Categoric, second Numeric. Summary tables, where the (maybe multiple) summaries are for each factor. So I think it should be multiple summaries as columns, by each of the categorical variables in turn. So again multiple tables - now we can do that! c) Multiple numeric now gives ANOVA table, whether second is numeric or categorical. It gives a separate ANOVA table for each of the variables in the multiple receiver.
This is done (for one of the options already, but it uses a function. I would prefer it to use just the code for the commands instead, so you see that it is using lm. This is now done already (by @lilyclements and @derekagorhom in the (more complicated) 3-way case. I hope you can adapt that code.

That's stage 1 for 2 variables. Notice we have lost the correlations option. Don't delete that code, because I suggest we still need it, see below.

So two (new) improvements:

a) Numeric by Numeric we could also have the correlations. So add a checkbox Correlations. Default unchecked. If checked, then it gives the ANOVA anyway, plus the correlations. (Later we may add another checkbox perhaps saying Model where we give the formula for the regression line. Again default is unchecked.) b) Numeric by Categorical. Have a checkbox with label Means. If checked it gives the Means as well as the ANOVA table.

c) .And another change - maybe later. (But I think it is a reall "goodie" and the first steps can be done now!) Add a Checkbox saying Swap y and x) Default unchecked. For now make it disabled.

d) In the variables for this Summary (top radio button) Add (y) to the name, so it becomes First Variables (y): And also Second Variable (x):

I would like to merge initially at this stage. Then continue with the rest below:

Initially I am just interested in a) Numeric by Numeric: Then it give the ANOVA with the same y (second variable) and each x in turn. The default is ANOVA for lots of alternative y variables and same x.
b) Categorical by Numeric becomes effectively Numeric by Categorical, so now ANOVA with one factor (as the ordinary Numeric by Categorical) would be, but with the same Y and lots of categorical x's. (I'll worry about the other combinations later! I hope we don't need to change anything there! So the Swap y and x checkbox is currently disabled for Categorical by Categorical and Numeric by Categorical.)

Vitalis95 commented 6 months ago

@rdstern , @lilyclements a clarifications on the following; In the 2 var summaries ,for now the Categorical by Numeric gives Anova table, should it be summary tablesso that when we swap it givesAnova tables? Also for the Numeric by Categorical, it gives summary tables , should it be Anova tables or both?

lilyclements commented 6 months ago

@Vitalis95

If the y is numerical, and the x is categorical, it should give an ANOVA table. Is this what you mean by categorical by numeric? (Apologies, I can get confused!)

If the x is numerical, and the y is categorical, we can get summaries. If the y is categorical, then we shouldn't have an ANOVA table. (an ANOVA table is fitted to a model where y is normally distributed)

rdstern commented 6 months ago

@Vitalis95 we can chat today. I think you are correct and that's what I posted last week. You may want to read that post again? Numeric (Multiple) by Categorical now gives ANOVA and so does Numeric (Multiple) by Numeric.
With Numeric (Multiple) by Numeric you now also add a Correlations checkbox, default unchecked. With Numeric (Multiple) by Categorical you now add a Means checkbox. Default unchecked. If checked it also gives a table of means.

rdstern commented 4 months ago

This still needs the 3 variable, so I'm re-opening

Vitalis95 commented 1 month ago

@lilyclements , for the 3 variables , when the means=TRUE,

y_col_names_list <- "yield"
purrr::walk(.x=y_col_names_list, .f= ~ data_book$anova_tables2(data="survey",  x_col_names=c("variety", "fertgrp"), y_col_name=.x, signif.stars=FALSE, sign_level=FALSE, means=TRUE, total=TRUE))
rm(y_col_names_list)

We get the following error;

image

Please can you also add the interaction term

Vitalis95 commented 3 weeks ago

@lilyclements , include_margins argument in summary_table produces an error, it used to work before

image

here is the code;

survey <- data_book$get_data_frame(data_name="survey")
last_table <- survey %>% pivot_wider(names_from={{ .x }}, values_from=value) %>% purrr::map(.f=~data_book$summary_table(data_name="survey", percentage_type="factors", perc_total_factors="variety", summaries=count_label, include_margins=TRUE, margin_name="All", treat_columns_as_factor=FALSE, columns_to_summarise=.x, factors=c("village",.x)) %>% pivot_wider(names_from={{ .x }}, values_from=value) %>% gt::gt(), .x="variety")
data_book$add_object(data_name="survey", object_name="last_table", object_type_label="table", object_format="html", object=last_table)
data_book$get_object_data(data_name="survey", object_name="last_table", as_file=TRUE)
lilyclements commented 3 weeks ago

@Vitalis95 thanks for this. To fix this, can you amend the anova_tables2 function in the data_object_R6.R file to be:

(Really simple - just changing the line

    if (class(mod$model[[x_col_names]]) %in% c("numeric", "integer")){

to

    if (class(mod$model[[x_col_names[[1]]]]) %in% c("numeric", "integer")){

)

If it's easier: The entire function should now be:

DataSheet$set("public", "anova_tables2", function(x_col_names, y_col_name, total = FALSE, signif.stars = FALSE, sign_level = FALSE, means = FALSE) {
  if (missing(x_col_names) || missing(y_col_name)) stop("Both x_col_names and y_col_names are required")
  if (sign_level || signif.stars) message("This is no longer descriptive")
  if (sign_level) end_col = 5 else end_col = 4

  # Construct the formula
  if (length(x_col_names) == 1) {
    formula_str <- paste0(as.name(y_col_name), "~ ", as.name(x_col_names))
  } else if (length(x_col_names) > 1) {
    formula_str <- paste0(as.name(y_col_name), "~ ", as.name(paste(x_col_names, collapse = " + ")))
  }

  # Fit the model
  mod <- lm(formula = as.formula(formula_str), data = self$get_data_frame())
  anova_mod <- anova(mod)[1:end_col] %>% tibble::as_tibble(rownames = " ")

  # Add the total row if requested
  if (total) anova_mod <- anova_mod %>% tibble::add_row(` ` = "Total", dplyr::summarise(., across(where(is.numeric), sum)))
  anova_mod$`F value` <- round(anova_mod$`F value`, 4)
  if (sign_level) anova_mod$`Pr(>F)` <- format.pval(anova_mod$`Pr(>F)`, digits = 4, eps = 0.001)
  cat(paste0("ANOVA of ", formula_str, ":\n"))
  print(anova_mod)
  cat("\n")
  # Optionally print means
  if (means) {
    if (class(mod$model[[x_col_names[[1]]]]) %in% c("numeric", "integer")){
      cat("Model coefficients:\n")
      print(mod$coefficients)
      cat("\n")
    } else {
      cat(paste0("Means table of ", y_col_name, ":\n"))
      print(model.tables(aov(mod), type = "means"))
      cat("\n")
    }
  }
}