chsmiss / ATAC-amp

Searching for co-amplified regions on the genome from ATAC-seq data
MIT License
4 stars 0 forks source link

Understanding ATAC-amp parameters and interpreting output #3

Open bkinnersley opened 4 months ago

bkinnersley commented 4 months ago

Hello,

Thanks for this very useful package!

I just have a few questions after runnning on single-cell ATAC-Seq libraries generated from the 10X multiome kit. I ran using the following commands: python /path/to/software/ATAC-amp/AtacAmp.py \ --bam \ --name \ --isize_value 1000 \ --interval_size 1000 \ --mapq 30 \ --mode 0 \ --type sc \ --gtf /path/to/hg38.ncbiRefSeq.gtf \ --threads 16

I've attached output files from this run in "TEST_output.zip"

I just have a few questions:

  1. Could you suggest how we interpret the output files of ATAC-amp, and how this can help prioritise identification of genuine eCDNA amplicons? Out of those in the attached "TEST_output.zip" do you think there are any promising candidates?
  2. Please could you provide more details on some of the parameter settings: a. --mode 0, 1, 2 - what are these different modes, and what is "--discbk"? b. --isize_value - what is this value and is 1000 recommended for both bulk ATAC-Seq and single-cell ATAC-Seq data? c. --interval_size - as above what does this parameter control and is 1000 recommended for both bulk and single-cell ATAC-Seq data?
  3. Is it possible to estimate the count/abundance of candidate ecDNA amplicons from the output of ATAC-amp?

Thanks very much

Best wishes

Ben TEST_output.zip

chsmiss commented 4 months ago

Hi ben, For your first question, "how we interpret the output files of ATAC-amp, and how this can help prioritise identification of genuine eCDNA amplicons? " For bulk ATAC-seq data in ‘bulk’ mode, there is only one main result from ATACAmp, the ‘.result’ file, which contains the possible eCDNA/hsr forming regions, ordered by score from highest to lowest will be. In single-cell ATAC data in ‘sc’ mode, this file will be slightly different, and in the last line of each possible ecdna/hsr region, there will be the barcode of cells that supports these regions for subsequent analyses at the cell population level. However, the results of the current ATACAmp analysis are very susceptible to the quality of the data, so for your data, I would suggest to do QC before analysing it using high quality reads, what I understand is that there are not many cases of fragments on chrY forming ecDNA, and you can prioritise regions carrying oncogenes and regions larger than 100kb.

About some parameters you mentioned 1, -Mode 0, 1, 2 is on behalf of using different input files to run ATACAmp, ‘0’ mode accept the bam file, ‘1’ mode accept the split reads and discordant reads file, and ‘2’ mode accept the interval file, in order to get the breakpoint information from other software for analysing and saving the time of running after adjusting the parameters. 2, -isize_value is the insert size of the discordant reads, this is related to the sequencing library construction method, but 1000 is a more suitable value for most of the second-generation sequencing methods on the market. 3, --interval_size controls the step size from the breakpoint when calculating the amplified region, 1000 is an empirical parameter, you can also try a larger value to speed up the calculation, or use a smaller value to make the boundaries finer. Finally, at the moment ATACAmp still has limited resolution of single cells and cannot analyse abundance for the time being, but we will continue to build on this software with updates to detect variants in conjunction with new single-cell genome-level sequencing technologies.