cafferychen777 / ggpicrust2

Make Picrust2 Output Analysis and Visualization Easier
https://cafferychen777.github.io/ggpicrust2/
Other
92 stars 11 forks source link

pathway_annotation(): Error in $<-.data.frame(*tmp*, "pathway_description" #22

Closed kaw97 closed 1 year ago

kaw97 commented 1 year ago

Hi- thanks a ton for developing ggpicrust2.

I encounter the following error both when I use the full pipeline command, ggpicrust2() and the manual step-by-step mode suggested in case of errors. It looks like the issue is that there is a line break in the KEGG description for Ko04910.

https://www.genome.jp/dbget-bin/www_bget?ko04910

Is there a straightforward fix I can make locally? I'm fairly proficient in python but have very little experience with R.

We are connecting to the KEGG database to get the latest results, please wait patiently. Error in $<-.data.frame(*tmp*, "pathway_description", value = c("Insulin binding to its receptor results in the tyrosine phosphorylation of insulin receptor substrates (IRS) by the insulin receptor tyrosine kinase (INSR). This allows association of IRSs with the regulatory subunit of phosphoinositide 3-kinase (PI3K). PI3K activates 3-phosphoinositide-dependent protein kinase 1 (PDK1), which activates Akt, a serine kinase. Akt in turn deactivates glycogen synthase kinase 3 (GSK-3), leading to activation of glycogen synthase (GYS) and thus glycogen synthesis. Activation of Akt also results in the translocation of GLUT4 vesicles from their intracellular pool to the plasma membrane, where they allow uptake of glucose into the cell. Akt also leads to mTOR-mediated activation of protein synthesis by eIF4 and p70S6K. The translocation of GLUT4 protein is also elicited through the CAP/Cbl/TC10 pathway, once Cbl is phosphorylated by INSR.", : replacement has 2 rows, data has 1

cafferychen777 commented 1 year ago

Hi,

Thank you for your email and for using ggpicrust2. I'm sorry to hear that you've encountered an error while using the pipeline command and the manual step-by-step mode.

To address the issue with the KEGG description import, you can update to the latest version of ggpicrust2 using the following command:

devtools::install_github("cafferychen777/ggpicrust2")

This should resolve the error message you've received about the replacement having 2 rows and the data having only 1.

Please let me know if you have any further questions or concerns.

Best regards,

kaw97 @.***>于2023年5月1日 周一06:07写道:

Hi- thanks a ton for developing ggpicrust2.

I encounter the following error both when I use the full pipeline command, ggpicrust2() and the manual step-by-step mode suggested in case of errors. Is there anything I can change locally to get it to import the KEGG description correctly?

We are connecting to the KEGG database to get the latest results, please wait patiently. Error in $<-.data.frame(tmp, "pathway_description", value = c("Insulin binding to its receptor results in the tyrosine phosphorylation of insulin receptor substrates (IRS) by the insulin receptor tyrosine kinase (INSR). This allows association of IRSs with the regulatory subunit of phosphoinositide 3-kinase (PI3K). PI3K activates 3-phosphoinositide-dependent protein kinase 1 (PDK1), which activates Akt, a serine kinase. Akt in turn deactivates glycogen synthase kinase 3 (GSK-3), leading to activation of glycogen synthase (GYS) and thus glycogen synthesis. Activation of Akt also results in the translocation of GLUT4 vesicles from their intracellular pool to the plasma membrane, where they allow uptake of glucose into the cell. Akt also leads to mTOR-mediated activation of protein synthesis by eIF4 and p70S6K. The translocation of GLUT4 protein is also elicited through the CAP/Cbl/TC10 pathway, once Cbl is phosphorylated by INSR.", : replacement has 2 rows, data has 1

— Reply to this email directly, view it on GitHub https://github.com/cafferychen777/ggpicrust2/issues/22, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZEQTSF27LAYRNNLBCY5PTXD3PALANCNFSM6AAAAAAXRF6NEU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kaw97 commented 1 year ago

Thank you very much! The annotations have been correctly added to daa_annotated_sub_method_results_df. However, I am still getting the following errors. It seems like the issue is most likely with xtfrm.

We are connecting to the KEGG database to get the latest results, please wait patiently. Error in xtfrm.data.frame(x) : cannot xtfrm data frames In addition: Warning message: In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_groupnonsense", : Some features have less than 3 nonzero values! They have virtually no statistical power. You may consider filtering them in the analysis!

View(daa_results_list) Error: Index out of bounds

cafferychen777 commented 1 year ago

Would you like to share your code? Cause if there is no code, it will be difficult for me to figure out the problem.

kaw97 commented 1 year ago

Absolutely. There was a lot of interactive stuff, so I made a new session and reran the commands.

library(readr) library(ggpicrust2) library(tibble) library(tidyverse) library(ggprism) library(patchwork)

FILE <- "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/pred_metagenome_KO_unstrat.tsv"

group = "Cultivation" metadata <-

  • read_delim(
  • "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/metadata2.tsv",
  • delim = "\t",
  • escape_double = FALSE,
  • trim_ws = TRUE)

That results in the following output:

Rows: 2500 Columns: 11
── Column specification ──────────────────────────────────────────────── Delimiter: "\t" chr (1): function dbl (10): 10_16s, 1_16s, 2_16s, 3_16s, 4_16s, 5_16s, 6_16s, 7_16s, 8...

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. Calculation may take a long time, please be patient. The kegg pathway with zero abundance in all the different samples has been removed. 0 features are filtered! The filtered data has 10 samples and 186 features will be tested! Pseudo-count approach is used. Fit linear models ... Completed. We are connecting to the KEGG database to get the latest results, please wait patiently. Registered S3 method overwritten by 'httr': method from
print.response rmutil Error in xtfrm.data.frame(x) : cannot xtfrm data frames In addition: Warning message: In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_groupnonsense", : Some features have less than 3 nonzero values! They have virtually no statistical power. You may consider filtering them in the analysis!

Here is my sessioninfo output:

R version 4.3.0 (2023-04-21 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods
[7] base

other attached packages: [1] patchwork_1.1.2 ggprism_1.0.4 lubridate_1.9.2 forcats_1.0.0
[5] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 tidyr_1.3.0
[9] ggplot2_3.4.2 tidyverse_2.0.0 tibble_3.2.1 ggpicrust2_1.6.2 [13] readr_2.1.4

loaded via a namespace (and not attached): [1] bitops_1.0-7 permute_0.9-7
[3] rlang_1.1.1 magrittr_2.0.3
[5] clue_0.3-64 ade4_1.7-22
[7] matrixStats_0.63.0 compiler_4.3.0
[9] mgcv_1.8-42 png_0.1-8
[11] vctrs_0.6.2 reshape2_1.4.4
[13] rmutil_1.1.10 pkgconfig_2.0.3
[15] crayon_1.5.2 XVector_0.40.0
[17] utf8_1.2.3 modeest_2.4.0
[19] tzdb_0.3.0 nloptr_2.0.3
[21] bit_4.0.5 zlibbioc_1.46.0
[23] GenomeInfoDb_1.36.0 jsonlite_1.8.4
[25] biomformat_1.28.0 rhdf5filters_1.12.0
[27] Rhdf5lib_1.22.0 parallel_4.3.0
[29] cluster_2.1.4 R6_2.5.1
[31] stringi_1.7.12 MicrobiomeStat_1.1
[33] boot_1.3-28.1 rpart_4.1.19
[35] numDeriv_2016.8-1.1 Rcpp_1.0.10
[37] iterators_1.0.14 IRanges_2.34.0
[39] Matrix_1.5-4 splines_4.3.0
[41] igraph_1.4.2 timechange_0.2.0
[43] tidyselect_1.2.0 rstudioapi_0.14
[45] vegan_2.6-4 timeDate_4022.108
[47] codetools_0.2-19 curl_5.0.0
[49] lattice_0.21-8 lmerTest_3.1-3
[51] plyr_1.8.8 KEGGREST_1.40.0
[53] Biobase_2.60.0 withr_2.5.0
[55] stable_1.1.6 survival_3.5-5
[57] Biostrings_2.68.0 pillar_1.9.0
[59] phyloseq_1.44.0 foreach_1.5.2
[61] stats4_4.3.0 generics_0.1.3
[63] vroom_1.6.3 RCurl_1.98-1.12
[65] S4Vectors_0.38.0 hms_1.1.3
[67] munsell_0.5.0 scales_1.2.1
[69] minqa_1.2.5 timeSeries_4021.105
[71] glue_1.6.2 statip_0.2.3
[73] tools_4.3.0 data.table_1.14.8
[75] lme4_1.1-33 spatial_7.3-16
[77] fBasics_4022.94 rhdf5_2.44.0
[79] grid_4.3.0 ape_5.7-1
[81] colorspace_2.1-0 nlme_3.1-162
[83] GenomeInfoDbData_1.2.10 cli_3.6.1
[85] fansi_1.0.4 gtable_0.3.3
[87] stabledist_0.7-1 digest_0.6.31
[89] BiocGenerics_0.46.0 ggrepel_0.9.3
[91] multtest_2.56.0 lifecycle_1.0.3
[93] httr_1.4.5 statmod_1.5.0
[95] bit64_4.0.5 MASS_7.3-58.4

kaw97 commented 1 year ago

Oops, I just realized I didn't include the ggpicrust2 command.

Here it is. This is what leads to the error message in my last post.

In short: To reproduce the error, I started a new session. Then I ran the import commands, set the identity of the file, and imported the metadata tsv. Then I ran the the following command:

> daa_results_list <-
+     ggpicrust2(
+         file = FILE,
+         metadata = metadata,
+         group = group,
+         pathway = "KO",
+         daa_method = "LinDA",
+         p_values_bar = TRUE,
+         p.adjust = "BH",
+         ko_to_kegg = TRUE,
+         order = "pathway_class",
+         select = NULL,
+         reference = NULL # If your metadata[,group] has more than two levels, please specify a reference.
+     )

Here is the full text of the rstudio console:

> library(readr)
> library(ggpicrust2)
> library(tibble)
> library(tidyverse)
── Attaching core tidyverse packages ──────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ purrr     1.0.1
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tidyr     1.3.0
✔ lubridate 1.9.2     
── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
> library(ggprism)
> library(patchwork)
> FILE <- "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/pred_metagenome_KO_unstrat.tsv"
> group = "Cultivation"
> metadata <-
+   read_delim(
+     "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/metadata2.tsv",
+     delim = "\t",
+     escape_double = FALSE,
+     trim_ws = TRUE)
Rows: 10 Columns: 4                                                                                        
── Column specification ────────────────────────────────────────────────
Delimiter: "\t"
chr (4): #SampleID, Cultivation, Strain, Site

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> daa_results_list <-
+     ggpicrust2(
+         file = FILE,
+         metadata = metadata,
+         group = group,
+         pathway = "KO",
+         daa_method = "LinDA",
+         p_values_bar = TRUE,
+         p.adjust = "BH",
+         ko_to_kegg = TRUE,
+         order = "pathway_class",
+         select = NULL,
+         reference = NULL # If your metadata[,group] has more than two levels, please specify a reference.
+     )
Rows: 2500 Columns: 11                                                                                     
── Column specification ────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): function
dbl (10): 10_16s, 1_16s, 2_16s, 3_16s, 4_16s, 5_16s, 6_16s, 7_16s, 8...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Calculation may take a long time, please be patient.
The kegg pathway with zero abundance in all the different samples has been removed.
0  features are filtered!
The filtered data has  10  samples and  186  features will be tested!
Pseudo-count approach is used.
Fit linear models ...
Completed.
We are connecting to the KEGG database to get the latest results, please wait patiently.
Registered S3 method overwritten by 'httr':
  method         from  
  print.response rmutil
Error in xtfrm.data.frame(x) : cannot xtfrm data frames
In addition: Warning message:
In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_group_nonsense_",  :
  Some features have less than 3 nonzero values! 
                        They have virtually no statistical power. You may consider filtering them in the analysis!

> (.packages())
 [1] "patchwork"  "ggprism"    "lubridate"  "forcats"    "stringr"   
 [6] "dplyr"      "purrr"      "tidyr"      "ggplot2"    "tidyverse" 
[11] "tibble"     "ggpicrust2" "readr"      "stats"      "graphics"  
[16] "grDevices"  "utils"      "datasets"   "methods"    "base"      
> 
> packages <- (.packages())
> packages
 [1] "patchwork"  "ggprism"    "lubridate"  "forcats"    "stringr"   
 [6] "dplyr"      "purrr"      "tidyr"      "ggplot2"    "tidyverse" 
[11] "tibble"     "ggpicrust2" "readr"      "stats"      "graphics"  
[16] "grDevices"  "utils"      "datasets"   "methods"    "base"      
> packageVersion(packages)
Error in if (pkg == "base") file.path(.Library, "base") else if (isNamespaceLoaded(pkg)) getNamespaceInfo(pkg,  : 
  the condition has length > 1
> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] patchwork_1.1.2  ggprism_1.0.4    lubridate_1.9.2  forcats_1.0.0   
 [5] stringr_1.5.0    dplyr_1.1.2      purrr_1.0.1      tidyr_1.3.0     
 [9] ggplot2_3.4.2    tidyverse_2.0.0  tibble_3.2.1     ggpicrust2_1.6.2
[13] readr_2.1.4     

loaded via a namespace (and not attached):
 [1] bitops_1.0-7            permute_0.9-7          
 [3] rlang_1.1.1             magrittr_2.0.3         
 [5] clue_0.3-64             ade4_1.7-22            
 [7] matrixStats_0.63.0      compiler_4.3.0         
 [9] mgcv_1.8-42             png_0.1-8              
[11] vctrs_0.6.2             reshape2_1.4.4         
[13] rmutil_1.1.10           pkgconfig_2.0.3        
[15] crayon_1.5.2            XVector_0.40.0         
[17] utf8_1.2.3              modeest_2.4.0          
[19] tzdb_0.3.0              nloptr_2.0.3           
[21] bit_4.0.5               zlibbioc_1.46.0        
[23] GenomeInfoDb_1.36.0     jsonlite_1.8.4         
[25] biomformat_1.28.0       rhdf5filters_1.12.0    
[27] Rhdf5lib_1.22.0         parallel_4.3.0         
[29] cluster_2.1.4           R6_2.5.1               
[31] stringi_1.7.12          MicrobiomeStat_1.1     
[33] boot_1.3-28.1           rpart_4.1.19           
[35] numDeriv_2016.8-1.1     Rcpp_1.0.10            
[37] iterators_1.0.14        IRanges_2.34.0         
[39] Matrix_1.5-4            splines_4.3.0          
[41] igraph_1.4.2            timechange_0.2.0       
[43] tidyselect_1.2.0        rstudioapi_0.14        
[45] vegan_2.6-4             timeDate_4022.108      
[47] codetools_0.2-19        curl_5.0.0             
[49] lattice_0.21-8          lmerTest_3.1-3         
[51] plyr_1.8.8              KEGGREST_1.40.0        
[53] Biobase_2.60.0          withr_2.5.0            
[55] stable_1.1.6            survival_3.5-5         
[57] Biostrings_2.68.0       pillar_1.9.0           
[59] phyloseq_1.44.0         foreach_1.5.2          
[61] stats4_4.3.0            generics_0.1.3         
[63] vroom_1.6.3             RCurl_1.98-1.12        
[65] S4Vectors_0.38.0        hms_1.1.3              
[67] munsell_0.5.0           scales_1.2.1           
[69] minqa_1.2.5             timeSeries_4021.105    
[71] glue_1.6.2              statip_0.2.3           
[73] tools_4.3.0             data.table_1.14.8      
[75] lme4_1.1-33             spatial_7.3-16         
[77] fBasics_4022.94         rhdf5_2.44.0           
[79] grid_4.3.0              ape_5.7-1              
[81] colorspace_2.1-0        nlme_3.1-162           
[83] GenomeInfoDbData_1.2.10 cli_3.6.1              
[85] fansi_1.0.4             gtable_0.3.3           
[87] stabledist_0.7-1        digest_0.6.31          
[89] BiocGenerics_0.46.0     ggrepel_0.9.3          
[91] multtest_2.56.0         lifecycle_1.0.3        
[93] httr_1.4.5              statmod_1.5.0          
[95] bit64_4.0.5             MASS_7.3-58.4 
cafferychen777 commented 1 year ago

Hello @kaw97,

Would you like to try the another workflow in the tutorial? It's a better workflow for debug. And I can know the location in which function encounters errors.

Best regards,

cafferychen777 commented 1 year ago

And compared to sharing original data sources, sharing intermediate data is safer for researchers.

kaw97 commented 1 year ago

Oops, sorry - I misunderstood. Here is the full output of the longer workflow:

> FILE <- "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/pred_metagenome_KO_unstrat.tsv"
> group = "Cultivation"
> metadata <-
+   read_delim(
+     "C:/Users/kaw97/Documents/AEKNE/16s/analysis/picrust/metadata2.tsv",
+     delim = "\t",
+     escape_double = FALSE,
+     trim_ws = TRUE)
Rows: 10 Columns: 4                                                                                        
── Column specification ────────────────────────────────────────────────
Delimiter: "\t"
chr (4): #SampleID, Cultivation, Strain, Site

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> kegg_abundance <- ko2kegg_abundance(FILE)
Rows: 2500 Columns: 11                                                                                     
── Column specification ────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): function
dbl (10): 10_16s, 1_16s, 2_16s, 3_16s, 4_16s, 5_16s, 6_16s, 7_16s, 8...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Calculation may take a long time, please be patient.
The kegg pathway with zero abundance in all the different samples has been removed.
> daa_results_df <-
+     pathway_daa(
+         abundance = kegg_abundance,
+         metadata = metadata,
+         group = group,
+         daa_method = "LinDA",
+         select = NULL,
+         reference = NULL
+     )
0  features are filtered!
The filtered data has  10  samples and  186  features will be tested!
Pseudo-count approach is used.
Fit linear models ...
Completed.
Warning message:
In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_group_nonsense_",  :
  Some features have less than 3 nonzero values! 
                        They have virtually no statistical power. You may consider filtering them in the analysis!
> daa_sub_method_results_df <-
+     daa_results_df[daa_results_df$method == "LinDA", ]
> daa_annotated_sub_method_results_df <-
+     pathway_annotation(pathway = "KO",
+                        daa_results_df = daa_sub_method_results_df,
+                        ko_to_kegg = TRUE)
We are connecting to the KEGG database to get the latest results, please wait patiently.
>daa_results_list <-
+     pathway_errorbar(
+         abundance = kegg_abundance,
+         daa_results_df = daa_annotated_sub_method_results_df,
+         Group = Group,
+         p_values_threshold = 0.05,
+         order = "pathway_class",
+         select = NULL,
+         ko_to_kegg = TRUE,
+         p_value_bar = TRUE,
+         colors = NULL,
+         x_lab = "pathway_name"
+     )
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
> View(daa_results_list)
Error: Index out of bounds

Here is a link to most of the intermediate files:

https://drive.google.com/drive/folders/1KotMUjblMSbjVzVEfYGY1ORUeK-ZaL8n?usp=sharing

cafferychen777 commented 1 year ago
截屏2023-05-07 20 56 45

Hi @kaw97 ,

Thank you for your message. I have identified the issue in your code. Specifically, in the pathway_errorbar function, the Group argument should be a vector rather than a character.

library(readr)
library(ggpicrust2)
library(tidyverse)
library(patchwork)
library(ggprism)

daa_results_df_annotated <-
  read.delim(
    "/Users/apple/Microbiome/ggpicrust2总/ggpicrust2测试/ggpicrust2_test/Kaw97/daa_annotated_sub_method_results.tsv"
  )

kegg_abundance <-
  read.delim(
    "/Users/apple/Microbiome/ggpicrust2总/ggpicrust2测试/ggpicrust2_test/Kaw97/kegg_abundance.tsv"
  )

daa_results <- read.delim("~/Microbiome/ggpicrust2总/ggpicrust2测试/ggpicrust2_test/Kaw97/daa_results.tsv")

rownames(kegg_abundance) <- daa_results$feature

metadata <-
  read_delim(
    "/Users/apple/Microbiome/ggpicrust2总/ggpicrust2测试/ggpicrust2_test/Kaw97/metadata2.tsv",
    delim = "\t",
    escape_double = FALSE,
    trim_ws = TRUE
  )

p <- pathway_errorbar(
  abundance = kegg_abundance,
  daa_results_df = daa_results_df_annotated,
  Group = metadata$Cultivation,
  p_values_threshold = 0.05,
  order = "pathway_class",
  select = NULL,
  ko_to_kegg = TRUE,
  p_value_bar = TRUE,
  colors = NULL,
  x_lab = "pathway_name"
)

I hope this information is helpful to you. Please let me know if you have any further questions or concerns.

Best regards.