TiagoOlivoto / metan

Package for multi-environment trial analysis
https://tiagoolivoto.github.io/metan/
GNU General Public License v3.0
35 stars 17 forks source link

Inconsistency with cv_blup function #3

Closed billh2006 closed 4 years ago

billh2006 commented 4 years ago

Hi,

I am testing a sample dataset that has the same format as the oat dataset as demonstrated. This was a RCBD design with 3 replicates

str(dataset) 'data.frame': 791 obs. of 4 variables: $ ENV : Factor w/ 13 levels "2017_P","2017_S",..: 10 10 10 10 10 10 10 10 10 10 ... $ GEN : Factor w/ 30 levels "8xp","XR_A0",..: 23 24 30 2 10 26 11 3 12 27 ... $ BLOCK: int 1 1 1 1 1 1 1 1 1 1 ... $ GY : num 233 225 248 259 255 ...

When I use cv_blup, I get the following error message:

BLUP_model = cv_blup(dataset, resp = GY, gen = GEN, env = ENV, rep = BLOCK, nboot = 200, random="gen")

Error: Column error must be length 262 (the number of rows) or one, not 264 In addition: Warning message: In pred - testing$Y : longer object length is not a multiple of shorter object length

When I rerun it the command multiple times without changing anything else, the first line of the error message changes each time. For example: Error: Column error must be length 261 (the number of rows) or one, not 265 Error: Column error must be length 262 (the number of rows) or one, not 266 Error: Column error must be length 262 (the number of rows) or one, not 263 Error: Column error must be length 262 (the number of rows) or one, not 266 Error: Column error must be length 262 (the number of rows) or one, not 265 Error: Column error must be length 261 (the number of rows) or one, not 267 Error: Column error must be length 262 (the number of rows) or one, not 265

Do you know what is happening here?

Thanks!

TiagoOlivoto commented 4 years ago

Hi, thank you for your report. Could you please pass the column BLOCK to factor with dataset <- to_factor(dataset, BLOCK) ,and report if any warning is generated when you run inspect(data)? Is there any missing value in the data? Please, use has_na(data). Based on the levels of your experiment the expected number of rows would be 1170 (13 x 30 x 3), Am I right? Maybe have you unbalanced data? Try to make a two-way table with make_mat(dataset, ENV, GEN, GY) and see if any cell has NA.

You can use the package reprex to report the results here Regards!

TiagoOlivoto commented 4 years ago

@billh2006 I closed accidentally this issue. I added a commit that improves the check of data before computing the cross-validation. Please, install the development version and check with your data.

billh2006 commented 4 years ago

@TiagoOlivoto- Thanks for your quick response. Yes-my data set is unbalanced. Certain genotypes are not represented in all environments and 8 (thank you for the new error message) of the genotype-environment interactions are missing data for one or more of the reps. Is a complete, balanced dataset a necessary condition for cv_blup and downstream functions (e.g. waasb)?

TiagoOlivoto commented 4 years ago

@billh2006 - You can have unbalanced data but it is necessary that all genotype-environment combinations have the same number of replicates (complete dataset). If you assume to have three replicates, for each cross-validation procedure the function will select randomly two replicates for each genotype-environment combination to serve as training set. The other replicate will serve as validation set. If for some GEI combinations you have only two reps, the model cannot be computed because there will be no validation data for such a combination. Please, note a toy example bellow with the example data data_ge

library(metan)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> []=====================================================[]
#> [] Multi-Environment Trial Analysis (metan) v1.4.0.9000[]
#> [] Author: Tiago Olivoto                               []
#> [] Type citation('metan') to know how to cite metan    []
#> [] Type vignette('metan_start') for a short tutorial   []
#> [] Visit http://bit.ly/2TIq6JE for a complete tutorial []
#> []=====================================================[]
df <- data_ge
df[1, 4] <- NA
df
#> # A tibble: 420 x 5
#>    ENV   GEN   REP      GY    HM
#>    <fct> <fct> <fct> <dbl> <dbl>
#>  1 E1    G1    1     NA     44.9
#>  2 E1    G1    2      2.50  46.9
#>  3 E1    G1    3      2.43  47.8
#>  4 E1    G2    1      3.21  45.2
#>  5 E1    G2    2      2.93  45.3
#>  6 E1    G2    3      2.56  45.5
#>  7 E1    G3    1      2.77  46.7
#>  8 E1    G3    2      3.62  43.2
#>  9 E1    G3    3      2.28  47.8
#> 10 E1    G4    1      2.36  47.9
#> # ... with 410 more rows
cv_blup(df, env = ENV, gen = GEN, rep = REP, resp = GY, nboot = 10)
#>   ENV GEN n
#> 1  E1  G1 2
#> Error: Combinations of genotype and environment with different number of replication than observed in the trial (3)

 # An unbalanced data
df2 <- data_ge
df2[1:3, 4] <- NA
df2
#> # A tibble: 420 x 5
#>    ENV   GEN   REP      GY    HM
#>    <fct> <fct> <fct> <dbl> <dbl>
#>  1 E1    G1    1     NA     44.9
#>  2 E1    G1    2     NA     46.9
#>  3 E1    G1    3     NA     47.8
#>  4 E1    G2    1      3.21  45.2
#>  5 E1    G2    2      2.93  45.3
#>  6 E1    G2    3      2.56  45.5
#>  7 E1    G3    1      2.77  46.7
#>  8 E1    G3    2      3.62  43.2
#>  9 E1    G3    3      2.28  47.8
#> 10 E1    G4    1      2.36  47.9
#> # ... with 410 more rows
cv_blup(df2, env = ENV, gen = GEN, rep = REP, resp = GY, nboot = 10)
#> $RMSPD
#>          MODEL     RMSPD
#> 1  BLUP_g_RCBD 0.3552628
#> 2  BLUP_g_RCBD 0.3763954
#> 3  BLUP_g_RCBD 0.4266109
#> 4  BLUP_g_RCBD 0.4532448
#> 5  BLUP_g_RCBD 0.4043090
#> 6  BLUP_g_RCBD 0.4221250
#> 7  BLUP_g_RCBD 0.4243280
#> 8  BLUP_g_RCBD 0.3874620
#> 9  BLUP_g_RCBD 0.3697422
#> 10 BLUP_g_RCBD 0.3845615
#> 
#> $RMSPDmean
#> # A tibble: 1 x 6
#>   MODEL        mean     sd      se  Q2.5 Q97.5
#>   <chr>       <dbl>  <dbl>   <dbl> <dbl> <dbl>
#> 1 BLUP_g_RCBD 0.400 0.0308 0.00972 0.359 0.447
#> 
#> attr(,"class")
#> [1] "cvalidation"

Created on 2020-04-07 by the reprex package (v0.3.0)

You can use both unbalanced and incomplete data with waasb() and gamem_met(). Since we don't need a validation set, the function will simply ignore the missing data

library(metan)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> []=====================================================[]
#> [] Multi-Environment Trial Analysis (metan) v1.4.0.9000[]
#> [] Author: Tiago Olivoto                               []
#> [] Type citation('metan') to know how to cite metan    []
#> [] Type vignette('metan_start') for a short tutorial   []
#> [] Visit http://bit.ly/2TIq6JE for a complete tutorial []
#> []=====================================================[]
df <- data_ge
df[1, 4] <- NA
df
#> # A tibble: 420 x 5
#>    ENV   GEN   REP      GY    HM
#>    <fct> <fct> <fct> <dbl> <dbl>
#>  1 E1    G1    1     NA     44.9
#>  2 E1    G1    2      2.50  46.9
#>  3 E1    G1    3      2.43  47.8
#>  4 E1    G2    1      3.21  45.2
#>  5 E1    G2    2      2.93  45.3
#>  6 E1    G2    3      2.56  45.5
#>  7 E1    G3    1      2.77  46.7
#>  8 E1    G3    2      3.62  43.2
#>  9 E1    G3    3      2.28  47.8
#> 10 E1    G4    1      2.36  47.9
#> # ... with 410 more rows
mod <- waasb(df, ENV, GEN, REP, GY)
#> Warning: Row(s) 1 with NA values deleted.
#> Method: REML/BLUP
#> Random effects: GEN, GEN:ENV
#> Fixed effects: ENV, REP(ENV)
#> Denominador DF: Satterthwaite's method
#> ---------------------------------------------------------------------------
#> P-values for Likelihood Ratio Test of the analyzed traits
#> ---------------------------------------------------------------------------
#>     model       GY
#>  COMPLETE       NA
#>       GEN 1.19e-05
#>   GEN:ENV 2.02e-11
#> ---------------------------------------------------------------------------
#> All variables with significant (p < 0.05) genotype-vs-environment interaction
gmd(mod)
#> Class of the model: waasb
#> Variable extracted: genpar
#> # A tibble: 9 x 2
#>   Parameters                GY
#>   <chr>                  <dbl>
#> 1 Phenotypic variance    0.181
#> 2 Heritability           0.154
#> 3 GEIr2                  0.313
#> 4 Heribatility of means  0.814
#> 5 Accuracy               0.902
#> 6 rge                    0.370
#> 7 CVg                    6.24 
#> 8 CVr                   11.6  
#> 9 CV ratio               0.537

Created on 2020-04-07 by the reprex package (v0.3.0)

billh2006 commented 4 years ago

Thank you!