VanLoo-lab / CAMDAC

GNU General Public License v3.0
11 stars 5 forks source link

Error when calling ascat.runAscat from run_ASCAT.m #2

Closed franrodalg closed 1 year ago

franrodalg commented 1 year ago

Hi @elarosecadieux ,

First of all, many thanks and congratulations for your work.

I've been exploring CAMDAC as an alternative to the ASCAT cellularity estimates included in the Metabric data (although I'm not sure how different these would be), but I am encountering some issues.

Firstly, I was getting some error when the MspI fragments plots was intended to be saved, but I circumvented it by commenting out the plotting code.

Most importantly, though, I receive the following error message after the "ASCAT copy number segmentation" step has been completed:

Error in `[.data.table`(SNPpos, names(bafsegmented), ) : 
  When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
Calls: run_ASCAT.m -> ascat.runAscat -> runASCAT -> [ -> [.data.table

I am aware this is calling a function on the ASCAT package, but I assume this working is essential for CAMDAC.

Do you think it might be an issue with package incompatibility? Do you happen to know a way around it?

I am rerunning the tumour part again printing the session info right before the function above is called, and I will update this with the output as soon as I get it.

Cheers, Fran

elarosecadieux commented 1 year ago

Hi @franrodalg,

Thank you for your kind words!

We use the data.table package in all CAMDAC functions as it overcomes memory issues encountered with R data.frame. ASCAT uses the latter, which sometime leads to compatibility issues between the two packages. I will have a look and get back to you as soon as possible!

P.S. In addition to the R session info, could you also share the commands and flags used?

Many thanks, Lili

franrodalg commented 1 year ago

Thanks for the quick reply :)

I essentially used the same call for the wrapper as in the manual, just changing the paths, the sex to XX and the assembly as hg38 (copying here the tumour call; control would be basically the same but replacing ${case} with ${control})

Rscript --vanilla ${camdac_path}R/CAMDAC_wrapper.R \
    --patient_id=${patient} \
    --sample_id=${case} \
    --sex='XX' \
    --normal_infiltrates_proxy_id=${control} \
    --normal_origin_proxy_id=${control} \
    --patient_matched_normal_id=${control} \
    --bam=${root_fd}/data/bam_mask/${case}/${case}.bam \
    --ref='hg38' \
    --path_to_CAMDAC=${camdac_path} \
    --wdir=${root_fd}/data/camdac/ \
    --nc=${NSLOTS}

and the sessionInfo() output is:

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/apps/centos7/R/gcc/4.8.5/4.1.1/lib64/R/lib/libRblas.so
LAPACK: /share/apps/centos7/R/gcc/4.8.5/4.1.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  splines   stats4    stats     graphics  grDevices
 [8] utils     datasets  methods   base     

other attached packages:
 [1] fstcore_0.9.12       gtable_0.3.0         dplyr_1.0.9         
 [4] fst_0.9.8            readr_2.1.2          gridExtra_2.3       
 [7] ggplot2_3.3.6        scales_1.2.0         ASCAT_3.1.0         
[10] doParallel_1.0.17    iterators_1.0.14     foreach_1.5.2       
[13] data.table_1.14.2    RColorBrewer_1.1-3   Rsamtools_2.10.0    
[16] Biostrings_2.62.0    XVector_0.34.0       GenomicRanges_1.46.1
[19] GenomeInfoDb_1.30.1  IRanges_2.28.0       S4Vectors_0.32.4    
[22] BiocGenerics_0.40.0  stringr_1.4.0        optparse_1.7.1      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           compiler_4.1.1         pillar_1.7.0          
 [4] bitops_1.0-7           tools_4.1.1            zlibbioc_1.40.0       
 [7] lifecycle_1.0.1        tibble_3.1.7           pkgconfig_2.0.3       
[10] rlang_1.0.3            DBI_1.1.3              cli_3.3.0             
[13] GenomeInfoDbData_1.2.7 withr_2.5.0            hms_1.1.1             
[16] generics_0.1.3         vctrs_0.4.1            tidyselect_1.1.2      
[19] getopt_1.20.3          glue_1.6.2             R6_2.5.1              
[22] fansi_1.0.3            BiocParallel_1.28.3    tzdb_0.3.0            
[25] purrr_0.3.4            magrittr_2.0.3         codetools_0.2-18      
[28] ellipsis_0.3.2         assertthat_0.2.1       colorspace_2.0-3      
[31] utf8_1.2.2             stringi_1.7.6          RCurl_1.98-1.7        
[34] munsell_0.5.0          crayon_1.5.1  

Thanks so much for checking into this

elarosecadieux commented 1 year ago

Hi Fran,

I thought I would post a summary of our email exchanges here for the benefit of future users.

The user's RRBS data coverage was too low for CAMDAC. We recommend a minimum coverage of 75X as per our pre-print.

This is because RRBS is enriched for C>T SNPs and, at those loci, only reads from one strand can differentiate the SNP allele from the bisulphite converted allele. On average, this translates to half the coverage being informative for B-Allele Frequency calculation and thus for copy number profiling. Likewise, at polymorphic CpGs, only half of the reads inform the methylation rate calculation.

Thanks for trying the tool and we hope it is useful to you in other higher coverage datasets!