BDI-pathogens / phyloscanner

Phylogenetics between and within hosts at once, all along the genome.
GNU General Public License v3.0
47 stars 14 forks source link

summary_statistics.R Error: All columns in a tibble must be 1d or 2d objects: * Column `tree.id` is NULL #44

Closed magosil86 closed 5 years ago

magosil86 commented 5 years ago

Hi Matthew,

I get an error with phyloscanner's summary_statistics.R script, it looks like the error comes from a call to the function: calc.all.stats.in.window, have I missed something here, any suggestions? Please note that I intend to switch to just phyloscannerR going forward, I just wanted to complete the phyloscanner.R.utilities work flow first (i.e. summary_statistics.R and TransmissionSummary.R scripts). Attached is some data to help reproduce the problem pty_19-04-28-19-03-59.zip

Rscript ~/phyloscanner-master/tools/summary_statistics.R --scriptDir ~/phyloscanner-master/deprecated pty_19-04-28-19-03-59/ptyr5_patients.txt pty_19-04-28-19-03-59/ProcessedTree_s_ptyr5_InWindow_ pty_19-04-28-19-03-59/subgraphs_s_ptyr5_InWindow_ pty_19-04-28-19-03-59/ptyr5_ --tipRegex '^(.*)-fq[0-9]+_read_([0-9]+)_count_([0-9]+)$' --blacklists pty_19-04-28-19-03-59/ptyr5_blacklistdwns_InWindow_ --verbose

Error: All columns in a tibble must be 1d or 2d objects:
* Column `tree.id` is NULL
Backtrace:
    █
 1. └─base::lapply(...)
 2.   └─global::FUN(X[[i]], ...)
 3.     └─phyloscannerR::calc.all.stats.in.window(...)
 4.       └─tibble::tibble(...)
 5.         └─tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
 6.           └─tibble:::check_valid_cols(x)
Execution halted

Session info

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] kimisc_0.4          argparse_2.0.1      data.table_1.12.0  
 [4] dtplyr_0.0.3        dplyr_0.8.0.1       scales_1.0.0       
 [7] RColorBrewer_1.1-2  gridExtra_2.3       gtable_0.2.0       
[10] reshape_0.8.8       ggtree_1.14.6       ggplot2_3.1.0      
[13] phytools_0.6-60     maps_3.3.0          phyloscannerR_1.7.1
[16] ape_5.3            

loaded via a namespace (and not attached):
 [1] viridis_0.5.1           httr_1.4.0              tidyr_0.8.3            
 [4] jsonlite_1.6            viridisLite_0.3.0       network_1.14-377       
 [7] modelr_0.1.4            assertthat_0.2.1        expm_0.999-3           
[10] rvcheck_0.1.3           animation_2.6           cellranger_1.1.0       
[13] numDeriv_2016.8-1       pillar_1.3.1            backports_1.1.3        
[16] lattice_0.20-38         glue_1.3.1              quadprog_1.5-5         
[19] phangorn_2.4.0          digest_0.6.18           rvest_0.3.2            
[22] colorspace_1.4-1        Matrix_1.2-16           plyr_1.8.4             
[25] pkgconfig_2.0.2         broom_0.5.1             haven_2.1.0            
[28] purrr_0.3.2             tidytree_0.2.4          ff_2.2-14              
[31] tibble_2.1.1            combinat_0.0-8          generics_0.0.2         
[34] withr_2.1.2             tidyverse_1.2.1         lazyeval_0.2.2         
[37] mnormt_1.5-5            magrittr_1.5            crayon_1.3.4           
[40] readxl_1.3.1            memoise_1.1.0           GGally_1.4.0           
[43] nlme_3.1-137            MASS_7.3-51.1           forcats_0.4.0          
[46] xml2_1.2.0              tools_3.5.2             hms_0.4.2              
[49] extraDistr_1.8.10       stringr_1.4.0           findpython_1.0.5       
[52] munsell_0.5.0           plotrix_3.7-4           compiler_3.5.2         
[55] clusterGeneration_1.3.4 rlang_0.3.2             igraph_1.2.4           
[58] reshape2_1.4.3          R6_2.4.0                lubridate_1.7.4        
[61] bit_1.1-14              fastmatch_1.1-0         treeio_1.6.2           
[64] readr_1.3.1             stringi_1.4.3           parallel_3.5.2         
[67] Rcpp_1.0.1              scatterplot3d_0.3-41    tidyselect_0.2.5       
[70] coda_0.19-2            
mdhall272 commented 5 years ago

I don't think that command line can be right? --scriptDir as an argument hasn't worked since before phyloscannerR existed!

Nevertheless, I've fixed summary_statistics.R. Those scripts in /tools/ really are held together with sellotape at this point, however, and this experience has convinced me they should go ASAP.

magosil86 commented 5 years ago

Many thanks for the fix! Before you do away with the old scripts, quick question, the ptyr5_patStatsFull.csv produced by summary_statistics.R does not include file.suffix and id fields which are required by TransmissionSummary.R is there a possible work around or fix for this?

mdhall272 commented 5 years ago

Could you attach an example of that file? And do you definitely mean TransmissionSummary.R not transmission_summary.R?

magosil86 commented 5 years ago

Yeah, I wondered too why the Phyloscanner.R.utilities workflow specified TransmissionSummary.R rather than transmission_summary.R as the former is depracated, here is the error and I have attached the offending file pty_19-04-28-19-03-59.zip

Rscript ~/phyloscanner-master/deprecated/TransmissionSummary.R pty_19-04-28-19-03-59/ptyr5_patients.txt pty_19-04-28-19-03-59/ptyr5_classification_InWindow_ pty_19-04-28-19-03-59/ptyr5_patStatsFull.csv --minThreshold 1 --detailedOutput pty_19-04-28-19-03-59/ptyr5_trmStatsPerWindow.rda --allowMultiTrans --verbose
Getting window counts per patient from pty_19-04-28-19-03-59/ptyr5_patStatsFull.csv...
Error: The following required columns are missing from the summary csv file: file.suffix, id. Quitting.
Execution halted
mdhall272 commented 5 years ago

That's just because it's old. We're working on it!

If I remember rightly it should work if you change the column names; host.id to id and tree.id to file.suffix.

magosil86 commented 5 years ago

Ah, I see. Renaming the host.id and tree.id fields helps but it now reports:

Error in which(path.classification == "anc") : 
  object 'path.classification' not found
Calls: lapply ... FUN -> set -> [ -> [.data.table -> eval -> eval -> which
Execution halted
mdhall272 commented 5 years ago

Could you see what happens if you use transmission_summary.R instead? That at least should be current and working.

magosil86 commented 5 years ago

Sure, here is the output


Rscript ~/phyloscanner-master/tools/transmission_summary.R pty_19-04-28-19-03-59/ptyr5_classification_InWindow_ pty_19-04-28-19-03-59/ptyr5_trmStats.csv --minThreshold 1 --allowMultiTrans --verbose

Error: All columns in a tibble must be 1d or 2d objects:
* Column `tree.id` is NULL
Backtrace:
     █
  1. └─phyloscannerR::summarise.classifications(...)
  2.   └─phyloscannerR::merge.classifications(ptrees, allow.mt, verbose)
  3.     └─`%>%`(...)
  4.       ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  5.       └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  6.         └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  7.           └─phyloscannerR:::`_fseq`(`_lhs`)
  8.             └─magrittr::freduce(value, `_function_list`)
  9.               ├─base::withVisible(function_list[[k]](value))
 10.               └─function_list[[k]](value)
 11.                 └─purrr::map(...)
 12.                   └─phyloscannerR:::.f(.x[[i]], ...)
 13.                     └─tt %>% add_column(
Execution halted
mdhall272 commented 5 years ago

Was that with the renamed columns? Un-rename them if so.

magosil86 commented 5 years ago

It was with the original names: host id and tree.id, that said, I just tried it the other way around with id and file.suffix and it gives the same error:

Error: All columns in a tibble must be 1d or 2d objects:
* Column `tree.id` is NULL
Backtrace:
     █
  1. └─phyloscannerR::summarise.classifications(...)
  2.   └─phyloscannerR::merge.classifications(ptrees, allow.mt, verbose)
  3.     └─`%>%`(...)
  4.       ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  5.       └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  6.         └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  7.           └─phyloscannerR:::`_fseq`(`_lhs`)
  8.             └─magrittr::freduce(value, `_function_list`)
  9.               ├─base::withVisible(function_list[[k]](value))
 10.               └─function_list[[k]](value)
 11.                 └─purrr::map(...)
 12.                   └─phyloscannerR:::.f(.x[[i]], ...)
 13.                     └─tt %>% add_column(
Execution halted
mdhall272 commented 5 years ago

The classification file ptyr5_classification_InWindow_1525_to_1774.csv is empty. What are the earlier commands you used?

magosil86 commented 5 years ago

Here are the earlier commands as outlined in Phyloscanner.R.utilities:

Rscript ~/phyloscanner-master/tools/parsimony_based_blacklister.R 20 0 20 "pty_19-04-28-19-03-59/ptyr5_InWindow_" "pty_19-04-28-19-03-59/ptyr5_blacklistsank_InWindow" --dualsOutputFile "pty_19-04-28-19-03-59/ptyr5_duallistsank_InWindow" --outgroupName C.BW.00.00BW07621.AF443088 --tipRegex "^(.*)-fq[0-9]+_read_([0-9]+)_count_([0-9]+)$" --multifurcationThreshold 1e-05  --branchLengthNormalisation "pty_19-04-28-19-03-59/ptyr5_normconst.csv" --verbose
Rscript ~/phyloscanner-master/tools/downsample_reads.R 50 pty_19-04-28-19-03-59/ptyr5_ pty_19-04-28-19-03-59/ptyr5_blacklistdwns_ --blacklist pty_19-04-28-19-03-59/ptyr5_blacklistsank_ --tipRegex "^(.*)-fq[0-9]+_read_([0-9]+)_count_([0-9]+)$" --seed 42 --verbose
Rscript ~/phyloscanner-master/tools/split_hosts_to_subgraphs.R "pty_19-04-28-19-03-59/ptyr5_" "ptyr5" --blacklist "pty_19-04-28-19-03-59/ptyr5_blacklistdwns_" --outputdir "pty_19-04-28-19-03-59" --idFile "pty_19-04-28-19-03-59/ptyr5_patients.txt" --outgroupName C.BW.00.00BW07621.AF443088 --splitsRule s --kParam 20 --proximityThreshold 0 --readCountsMatterOnZeroBranches --tipRegex "^(.*)-fq[0-9]+_read_([0-9]+)_count_([0-9]+)$" --multifurcationThreshold 1e-05 --branchLengthNormalisation "pty_19-04-28-19-03-59/ptyr5_normconst.csv" --outputAsRDA --pdfwidth 30 --pdfrelheight 0.15 --verbose
Rscript ~/phyloscanner-master/tools/classify_relationships.R "pty_19-04-28-19-03-59/ProcessedTree_s_ptyr5_" "pty_19-04-28-19-03-59/subgraphs_s_ptyr5_" "pty_19-04-28-19-03-59/ptyr5" --branchLengthNormalisation "pty_19-04-28-19-03-59/ptyr5_normconst.csv" --verbose
Rscript ~/phyloscanner-master/tools/summary_statistics.R "pty_19-04-28-19-03-59/ptyr5_patients.txt" "pty_19-04-28-19-03-59/ProcessedTree_s_ptyr5_InWindow_" "pty_19-04-28-19-03-59/subgraphs_s_ptyr5_InWindow_" "pty_19-04-28-19-03-59/ptyr5_" --tipRegex "^(.*)-fq[0-9]+_read_([0-9]+)_count_([0-9]+)$" --blacklists "pty_19-04-28-19-03-59/ptyr5_blacklistdwns_InWindow_" --verbose
Rscript ~/phyloscanner-master/tools/transmission_summary.R "pty_19-04-28-19-03-59/ptyr5_classification_InWindow_" "pty_19-04-28-19-03-59/ptyr5_trmStats.csv" --minThreshold 1 --allowMultiTrans --verbose

Additionally, here is the output from classify_relationships.R:

+ Rscript ~/phyloscanner-master/tools/classify_relationships.R pty_19-04-28-19-03-59/ProcessedTree_s_ptyr5_ pty_19-04-28-19-03-59/subgraphs_s_ptyr5_ pty_19-04-28-19-03-59/ptyr5 --branchLengthNormalisation pty_19-04-28-19-03-59/ptyr5_normconst.csv --verbose
Reading tree file pty_19-04-28-19-03-59/ProcessedTree_s_ptyr5_InWindow_1525_to_1774.tree...
Reading annotations...
Reading splits file pty_19-04-28-19-03-59/subgraphs_s_ptyr5_InWindow_1525_to_1774.csv ...
Parsed with column specification:
cols(
  host = col_character(),
  subgraph = col_character(),
  tip = col_character()
)
Collecting tips for each host...
Collapsing subgraphs...
Identifying pairs of unblocked splits...
  |==================================================| 100%
Calculating pairwise distances between splits...
  |==================================================| 100%
Testing pairs...
  |==================================================| 100%
Writing to file pty_19-04-28-19-03-59/ptyr5_classification_InWindow_1525_to_1774.csv ...
mdhall272 commented 5 years ago

It's not especially clear from the console output (it would be using phyloscanner_analyse_trees.R) but the problem here is that there is practically no NGS data there. There is only one host in that tree (everything else is a reference), and since you're asking it for pairwise relationships between hosts, the scripts are a bit confused. What's the origin of the data?

magosil86 commented 5 years ago

The data is paired-end short reads from the IVA publication dataset that was used in the phyloscanner practical illustrating application to a real HIV dataset. I see what you mean with little data, but here are some examples from a different run with hosts (ERR732112 and ERR732075) that produces the same error Tree_ptyr2_InWindow_1550_to_1799.pdf Tree_ptyr2_InWindow_1525_to_1774.pdf

magosil86 commented 5 years ago

An update... solved. It appears TransmissionSummary.R expected the column name for ancestry values (e.g. anc, none, desc, multiAnc) in the classification files to be 'path.classification' rather than 'ancestry' hence an update to the column name resolved the issue. Many thanks!

mdhall272 commented 5 years ago

An update... solved. It appears TransmissionSummary.R expected the column name for ancestry values (e.g. anc, none, desc, multiAnc) in the classification files to be 'path.classification' rather than 'ancestry' hence an update to the column name resolved the issue. Many thanks!

Is this also relevant to transmission_summary.R?

magosil86 commented 5 years ago

Not quite, the error in transmission_summary.R seems to be triggered by a call to summarise.classifications which subsequently calls merge.classifications in collapsed_tree_methods.R. The line: tt <- tt %>% add_column(tree.id = ptree$id) in merge.classifications expects a field called 'id' in ptree but the structure of ptree (as obtained via: str(ptree)) shows the following fields:

List of 2
 $ classification.file.name: chr "trmstats_TransmissionSummary_script/ptyr2_trmStats/ptyr2_classification_InWindow_1000_to_1249.csv"
 $ suffix                  : chr "1000_to_1249"

hence updating the: tt <- tt %>% add_column(tree.id = ptree$id) in merge.classifications to: tt <- tt %>% add_column(tree.id = ptree$suffix) resolved the issue.

mdhall272 commented 5 years ago

The merge.classifications function is as I want it and it was the old scripts that hadn't been updated to use "tree.id" rather than "suffix" that were at fault. I've fixed this.