peak_detector_closest_hit_match() fails when all peaks discarded for a karyotype

caravagnalab / CNAqc

CNAqc - Copy Number Alteration (CNA) Quality Check package

GNU General Public License v3.0

17 stars 8 forks source link

peak_detector_closest_hit_match() fails when all peaks discarded for a karyotype #10

Closed bobamess closed 3 years ago

bobamess commented 3 years ago

I was happily using this very useful package, but then encountered a, possibly rare, event involving all peaks being discarded for a karyotype. Unfortunately, this event was not handled well causing the script to exit.

analyze_peaks() calls CNAqc:::peak_detector_closest_hit_match() which defines get_match()as

    get_match = function(p) {
        not_discarded = xy_peaks %>% filter(!discarded)
        distances = abs(not_discarded$x - expectation$peak[p])
        id_match = which.min(distances)
        if (length(id_match) == 0) 
            return(NA)
        else return(not_discarded[id_match, ])
    }

In the, possibly rare, event of all peaks being discarded for a karyotype this results in length(id_match) == 0 and get_match() returns NA.

Subsequently, with

matched_peaks = lapply(seq_along(expectation$peak), get_match)

we get

str(matched_peaks)

List of 1
 $ : logi NA

instead of the usual

 List of 1
 $ : tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ x             : num 0.15
  ..$ y             : num 6.97
  ..$ counts_per_bin: int 205
  ..$ discarded     : logi FALSE

And, after

matching = expectation %>% dplyr::bind_cols(Reduce(dplyr::bind_rows, matched_peaks))

we get

str(matching)

tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
 $ mutation_multiplicity: num 1
 $ karyotype            : chr "1:1"
 $ peak                 : num 0.473
 $ ...4                 : logi NA

instead of the more usual

str(matching)

tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
 $ mutation_multiplicity: num 1
 $ karyotype            : chr "1:1"
 $ peak                 : num 0.489
 $ x                    : num 0.15
 $ y                    : num 6.97
 $ counts_per_bin       : int 205
 $ discarded            : logi FALSE

Thus, the following line

matching$offset = matching$peak - matching$x

then gives this error and warning and the script exits.

Error: Assigned data `matching$peak - matching$x` must be compatible with existing data.
✖ Existing data has 1 row.
✖ Assigned data has 0 rows.
ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Unknown or uninitialised column: `x`.

Possible solution:

Either get_match() should return the required structured object instead of plain NA, or peak_detector_closest_hit_match() catches this event, skips the following lines and returns an empty list with the correct structure.

caravagn commented 3 years ago

Hi @bobamess thanks for the issue, I want to make sure you will be keeping using this tool happily!

Jokes aside, that is a rare error I agree, but definitely easy to fix. I understand the logic of your suggestion, is there any chance you could share a toy dataset to replicate it? It would help me understand better what I am missing and how to return the expected structure, and also wether this event affects other tool's functions etc.

If not, how about you make a pull request and implement the fix?

bobamess commented 3 years ago

Hi @caravagn Sure. Here's a toy example

library(tibble)
library(magrittr)
library(dplyr)

# this first example works as there is one remaining undiscarded peak

xy_peaks <- data.frame(x = as.numeric(c(0.15, 0.45, 0.49, 0.57, 0.62, 0.65, 0.7)),
                            y = as.numeric(c(6.97, 0.17, 0.07, 0.03, 0.03, 0.03, 0.04)),
                            counts_per_bin = as.integer(c(205, 10, 3, 1, 2, 2, 2)),
                            discarded = as.logical(c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)))

xy_peaks <- as_tibble(xy_peaks)

Then from CNAqc:::peak_detector_closest_hit_match()

get_match = function(p) {
    not_discarded = xy_peaks %>% filter(!discarded)
    distances = abs(not_discarded$x - expectation$peak[p])
    id_match = which.min(distances)
    if (length(id_match) == 0) 
        return(NA)
    else return(not_discarded[id_match, ])
}

matched_peaks = lapply(seq_along(expectation$peak), get_match)
matching = expectation %>% dplyr::bind_cols(Reduce(dplyr::bind_rows, matched_peaks))

And we get

str(matching)

tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
 $ mutation_multiplicity: num 1
 $ karyotype            : chr "1:1"
 $ peak                 : num 0.489
 $ x                    : num 0.15
 $ y                    : num 6.97
 $ counts_per_bin       : int 205
 $ discarded            : logi FALSE

matching$offset = matching$peak - matching$x

str(matching)

tibble [1 × 8] (S3: tbl_df/tbl/data.frame)
 $ mutation_multiplicity: num 1
 $ karyotype            : chr "1:1"
 $ peak                 : num 0.489
 $ x                    : num 0.15
 $ y                    : num 6.97
 $ counts_per_bin       : int 205
 $ discarded            : logi FALSE
 $ offset               : num 0.339

This next example fails as all are discarded as described earlier

xy_peaks_bak <- xy_peaks

xy_peaks <- data.frame(x = as.numeric(c(0.15, 0.45, 0.49, 0.57, 0.62, 0.65, 0.7)),
                            y = as.numeric(c(6.97, 0.17, 0.07, 0.03, 0.03, 0.03, 0.04)),
                            counts_per_bin = as.integer(c(205, 10, 3, 1, 2, 2, 2)),
                            discarded = as.logical(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)))

xy_peaks <- as_tibble(xy_peaks)

matched_peaks = lapply(seq_along(expectation$peak), get_match)

matching = expectation %>% dplyr::bind_cols(Reduce(dplyr::bind_rows, matched_peaks))

str(matching)

tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
 $ mutation_multiplicity: num 1
 $ karyotype            : chr "1:1"
 $ peak                 : num 0.489
 $ ...4                 : logi NA

matching$offset = matching$peak - matching$x

Error: Assigned data `matching$peak - matching$x` must be compatible with existing data.
✖ Existing data has 1 row.
✖ Assigned data has 0 rows.
ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Unknown or uninitialised column: `x`.

I can't think of exactly how to fix this and it may involve modifications to CNAqc:::analyze_peaks() after it calls CNAqc:::peak_detector_closest_hit_match()

Thanks in advance of a fix

caravagn commented 3 years ago

Thanks, I would need the values for expectation to run it. Can you put those here?

bobamess commented 3 years ago

Sorry, forgot to include that. This should work:

expectation <- data.frame(mutation_multiplicity = as.double(1), karyotype = as.character("1:1"), peak = as.double(0.489))
expectation <- as_tibble(expectation)

and give you

str(expectation)

Classes 'tbl_df', 'tbl' and 'data.frame':   1 obs. of  3 variables:
$ mutation_multiplicity: num 1
$ karyotype            : Factor w/ 1 level "1:1": 1
$ peak                 : num 0.489

caravagn commented 3 years ago

I think I might have fixed that, meaning that I intercept - I think - the fact the case with

 # Handle special case where everything is discarded by including the one
  # with highest value of counts_per_bin (just that).
  if(all(xy_peaks$discarded)) {
    xy_peaks = xy_peaks %>%
      mutate(
        discarded = ifelse(counts_per_bin == max(xy_peaks$counts_per_bin), TRUE, discarded)
      )
  }

  # Again, if the if-clause above did not suffice to find peaks, raise a stop error
  if(all(xy_peaks$discarded))
    stop("Cannot find peaks for this karyotype, raising an error (check the data distribution).")

The second if should catch it. The thing I cannot understand is why the first did not force discarded=TRUE on the largest of your peaks (I had put that for cases like this). Are you using the latest CNAqc version?

Also, now if the error propagates upstream I generate a general stop and break the computation. In order to something smarter - like working out QC with other copy states etc - I would need to see this run on real data (otherwise plugging in just the xy_peaks values gives limited room to understand what the real problem is - e.g., why the first if on top did not pick it up).

If you want me to look more into this, the ideal would be to have a CNAqc object to run.

bobamess commented 3 years ago

Yes, I had noticed that ifstatement and had wondered why it had not dealt with the situation.

I'm not keen on an if statement resulting in a stop as this then kills the pipeline, unless I design it to ignore failed processes and continue, which is possible in Nextflow. However, this would still exclude the sample pair from the results just because one of its karyotypes failed at this step and any data from its other karyotypes would also be excluded.

I would prefer if a warning could be issued and somehow the process enabled to continue with the sample pair included.

Within the analyze_peaks()function you have a for loop which begins

  for (k in karyotypes) {
        if (!(k %in% qc_karyotypes)) {
            detection = list(matching = NULL)
        }

If peak_detector_closest_hit_match() could return NA when this situation arises, instead of stopping, and this was checked, maybe it would then be possible to add to the detection object in a similar way for the offending karyotype?

Presumably, if peak_detector_closest_hit_match()returns NA then is.na(best_peaks) == TRUE and detection = list(matching = NULL) could be invoked and the for loop continued. Would that work?

Having said that, it seems to me that when if (!(k %in% qc_karyotypes)) returns TRUE this loses any data already collected for previous k in detection and so is only useful if this situation occurs for the first one, or first few, instances of k in karyotypes and not after useful data has been collected.

I've saved the CNAqc data object to the attached file (CNAqc_data.zip). After unzipping, as I'm sure you know, you should be able to load it with

data <- readRDS(file = "CNAqc_data.rds")

You should then be able to continue using this with

peaks <- analyze_peaks(data,
                        karyotypes = c('1:0', '1:1', '2:1', '2:0', '2:2'),
                        min_karyotype_size = 0.05,
                        min_absolute_karyotype_mutations = 50,
                        p_binsize_peaks = 0.005,
                        matching_epsilon = 0.025,
                        n_bootstrap = 1, 
                        kernel_adjust = 1, 
                        matching_strategy = "closest")

and get output with the error as below.

ℹ Requested karyotypes 1:0, 1:1, 2:1, 2:0, and 2:2. Matching strategy closest.
ℹ Found n = 3212 mutations in 1:1 (skipping those with n < 161 mutations).
New names:
* NA -> ...4
Error: Assigned data `matching$peak - matching$x` must be compatible with existing data.
✖ Existing data has 1 row.
✖ Assigned data has 0 rows.
ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Unknown or uninitialised column: `x`.

And also debug using code from analyze_peaks() and its dependencies.

Thanks again for trying to fix this. CNAqc_data.rds.zip

caravagn commented 3 years ago

Thanks. It does not crash to me like that.

> x <- analyze_peaks(x,
+                        karyotypes = c('1:0', '1:1', '2:1', '2:0', '2:2'),
+                        min_karyotype_size = 0.05,
+                        min_absolute_karyotype_mutations = 50,
+                        p_binsize_peaks = 0.005,
+                        matching_epsilon = 0.025,
+                        n_bootstrap = 1, 
+                        kernel_adjust = 1, 
+                        matching_strategy = "closest")
ℹ Requested karyotypes 1:0, 1:1, 2:1, 2:0, and 2:2. Matching strategy closest.
ℹ Found n = 3212 mutations in 1:1 (skipping those with n < 161 mutations).

── BMix fit ──────────────────────────────────────────────────────────────────────────────────────────────────────────

ℹ Binomials k_B = 1, 2, 3, and 4, Beta-Binomials k_BB = 0; 8 fits to run.
|============================================================================================|100% ~0 s remaining     
ℹ Bmix best fit completed in 0.05 mins

── [ BMix ] My BMix model n = 3212 with k = 1 component(s) (1 + 0) ───────────────────────────────────────────────────
• Clusters: π = 100% [Bin 1], with π > 0.
• Binomial Bin 1 with mean = 0.10070770739696.
ℹ Score (model selection): ICL = 15712.16.
# A tibble: 1 x 12
  mutation_multiplicity karyotype  peak     x     y counts_per_bin discarded  offset matched weight   score QC   
                  <dbl> <chr>     <dbl> <dbl> <dbl>          <int> <lgl>       <dbl> <lgl>    <dbl>   <dbl> <chr>
1                     1 1:1       0.473  0.47  0.02              1 FALSE     0.00300 TRUE         1 0.00300 PASS 
✓ Peak detection PASS with r = 0.003 and tolerance e = 0.05

And in fact

> x
── [ CNAqc ]  n = 3213 mutations in 52 segments (10 clonal + 42 subclonal). Genome reference: GRCh38. ────────────────

 1:1  [n = 3212, L =   NA Mb] ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
 1:0  [n =    1, L =   NA Mb] 

ℹ Sample Purity: 94.6% ~ Ploidy: 2.

──  PASS  Peaks QC closest: % 100 with q = 0.003 
ℹ 1:1 ~ n = 3212  (100%) →  PASS  0.003

Now if I notice your error message I see

New names:
* NA -> ...4

This is due - I am pretty sure - to some older version of one of the tidyverse packages, because a similar type of error I had to deal with another package of mine. It seems to me that something might not be updated there.

My sessionInfo() is

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CNAqc_0.1.1        RColorBrewer_1.1-2 peakPick_0.11      ggpubr_0.4.0       crayon_1.4.1       pio_0.1.0         
 [7] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.5        purrr_0.3.4        readr_1.4.0        tidyr_1.1.3       
[13] tibble_3.1.1       ggplot2_3.3.3      tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] httr_1.4.2        viridisLite_0.4.0 VGAM_1.1-5        jsonlite_1.7.2    splines_4.0.2     foreach_1.5.1    
 [7] carData_3.0-4     modelr_0.1.8      BMix_0.1.0        vcfR_1.12.0       assertthat_0.2.1  stats4_4.0.2     
[13] easypar_0.2.0     cellranger_1.1.0  ggrepel_0.9.1     pillar_1.6.0      backports_1.2.1   lattice_0.20-41  
[19] glue_1.4.2        ggsignif_0.6.1    rvest_1.0.0       colorspace_2.0-0  Matrix_1.3-2      cowplot_1.1.1    
[25] clisymbols_1.2.0  pkgconfig_2.0.3   broom_0.7.6       haven_2.4.0       scales_1.1.1      openxlsx_4.2.3   
[31] rio_0.5.26        mgcv_1.8-34       generics_0.1.0    car_3.0-10        ellipsis_0.3.1    withr_2.4.2      
[37] cli_2.5.0         magrittr_2.0.1    readxl_1.3.1      fs_1.5.0          fansi_0.4.2       MASS_7.3-53.1    
[43] doParallel_1.0.16 nlme_3.1-152      rstatix_0.7.0     xml2_1.3.2        foreign_0.8-81    vegan_2.5-7      
[49] tools_4.0.2       data.table_1.14.0 hms_1.0.0         lifecycle_1.0.0   munsell_0.5.0     reprex_2.0.0     
[55] cluster_2.1.1     zip_2.1.1         compiler_4.0.2    rlang_0.4.10      grid_4.0.2        iterators_1.0.13 
[61] rstudioapi_0.13   gtable_0.3.0      codetools_0.2-18  abind_1.4-5       DBI_1.1.1         curl_4.3         
[67] R6_2.5.0          lubridate_1.7.10  pinfsc50_1.2.0    utf8_1.2.1        permute_0.9-5     ape_5.4-1        
[73] stringi_1.5.3     parallel_4.0.2    Rcpp_1.0.6        vctrs_0.3.7       dbplyr_2.1.1      tidyselect_1.1.0

Can you post me yours?

bobamess commented 3 years ago

Interesting! There is a minor difference in the tidyverse package as well as with your CNAqc package.
My sessionInfo() was

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/gridware/depots/1a8f5697/el7/pkg/apps/R/3.6.2/gcc-4.8.5+lapack-3.5.0+blas-3.6.0/lib64/R/lib/libRblas.so
LAPACK: /opt/gridware/depots/1a8f5697/el7/pkg/apps/R/3.6.2/gcc-4.8.5+lapack-3.5.0+blas-3.6.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CNAqc_0.1.0        RColorBrewer_1.1-2 ggpubr_0.4.0       crayon_1.3.4      
 [5] forcats_0.5.0      stringr_1.4.0      dplyr_1.0.2        purrr_0.3.4       
 [9] readr_1.4.0        tidyr_1.1.2        tibble_3.0.4       tidyverse_1.3.0   
[13] pio_0.1.0          ggrepel_0.9.0      ggplot2_3.3.3      peakPick_0.11     
[17] CleanCNA_0.1.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        lubridate_1.7.9.2 ps_1.5.0          assertthat_0.2.1 
 [5] R6_2.5.0          cellranger_1.1.0  backports_1.2.1   reprex_0.3.0     
 [9] httr_1.4.2        pillar_1.4.7      rlang_0.4.10      curl_4.3         
[13] readxl_1.3.1      rstudioapi_0.13   data.table_1.13.6 car_3.0-10       
[17] foreign_0.8-72    munsell_0.5.0     broom_0.7.3       compiler_3.6.2   
[21] modelr_0.1.8      pkgconfig_2.0.3   tidyselect_1.1.0  rio_0.5.16       
[25] fansi_0.4.1       dbplyr_2.0.0      withr_2.3.0       grid_3.6.2       
[29] jsonlite_1.7.2    gtable_0.3.0      lifecycle_0.2.0   DBI_1.1.0        
[33] magrittr_2.0.1    scales_1.1.1      zip_2.1.1         cli_2.2.0        
[37] stringi_1.5.3     carData_3.0-4     ggsignif_0.6.0    fs_1.5.0         
[41] xml2_1.3.2        ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6      
[45] openxlsx_4.2.3    tools_3.6.2       glue_1.4.2        hms_0.5.3        
[49] abind_1.4-5       colorspace_2.0-0  rstatix_0.6.0     rvest_0.3.6      
[53] haven_2.3.1

After updating to CNAqc_0.1.1, and without updating tidyverse, it now works. The new sessionInfo() is now

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/gridware/depots/1a8f5697/el7/pkg/apps/R/3.6.2/gcc-4.8.5+lapack-3.5.0+blas-3.6.0/lib64/R/lib/libRblas.so
LAPACK: /opt/gridware/depots/1a8f5697/el7/pkg/apps/R/3.6.2/gcc-4.8.5+lapack-3.5.0+blas-3.6.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CNAqc_0.1.1        RColorBrewer_1.1-2 ggpubr_0.4.0       crayon_1.3.4      
 [5] forcats_0.5.0      stringr_1.4.0      dplyr_1.0.5        purrr_0.3.4       
 [9] readr_1.4.0        tidyr_1.1.2        tibble_3.0.4       tidyverse_1.3.0   
[13] pio_0.1.0          ggrepel_0.9.0      ggplot2_3.3.3      peakPick_0.11     
[17] CleanCNA_0.1.0    

loaded via a namespace (and not attached):
 [1] nlme_3.1-148      fs_1.5.0          lubridate_1.7.9.2 doParallel_1.0.16
 [5] httr_1.4.2        tools_3.6.2       backports_1.2.1   R6_2.5.0         
 [9] vegan_2.5-7       mgcv_1.8-31       DBI_1.1.0         colorspace_2.0-0 
[13] permute_0.9-5     withr_2.3.0       tidyselect_1.1.0  curl_4.3         
[17] compiler_3.6.2    cli_2.5.0         rvest_0.3.6       xml2_1.3.2       
[21] scales_1.1.1      foreign_0.8-72    rio_0.5.16        pkgconfig_2.0.3  
[25] dbplyr_2.0.0      rlang_0.4.10      readxl_1.3.1      rstudioapi_0.13  
[29] VGAM_1.1-5        generics_0.1.0    jsonlite_1.7.2    zip_2.1.1        
[33] car_3.0-10        magrittr_2.0.1    easypar_0.2.0     Matrix_1.2-18    
[37] Rcpp_1.0.5        munsell_0.5.0     ape_5.5           abind_1.4-5      
[41] lifecycle_1.0.0   stringi_1.5.3     carData_3.0-4     MASS_7.3-51.6    
[45] pinfsc50_1.2.0    grid_3.6.2        parallel_3.6.2    lattice_0.20-41  
[49] haven_2.3.1       cowplot_1.1.1     splines_3.6.2     hms_0.5.3        
[53] ps_1.5.0          pillar_1.4.7      ggsignif_0.6.0    codetools_0.2-16 
[57] clisymbols_1.2.0  stats4_3.6.2      vcfR_1.12.0       reprex_0.3.0     
[61] glue_1.4.2        BMix_0.1.0        data.table_1.13.6 modelr_0.1.8     
[65] vctrs_0.3.6       foreach_1.5.1     cellranger_1.1.0  gtable_0.3.0     
[69] assertthat_0.2.1  openxlsx_4.2.3    broom_0.7.3       rstatix_0.6.0    
[73] viridisLite_0.3.0 iterators_1.0.13  cluster_2.1.0     ellipsis_0.3.1

I can see that other packages also got updated when updating CNAqc, including lifecycle and dplyr, but it's not clear to me why it now works?

caravagn commented 3 years ago

Mmmm I think it all stems from a major updated of one of the tidyverse packages, which introduced a different strategy to handle names of tibbles. I bumped into this in pretty much all of my packages.

Glad that it now works; I will close the issue.