chsmiss / ATAC-amp

Searching for co-amplified regions on the genome from ATAC-seq data
MIT License
4 stars 0 forks source link

Understanding ATAC-amp parameters and interpreting output #3

Open bkinnersley opened 8 months ago

bkinnersley commented 8 months ago

Hello,

Thanks for this very useful package!

I just have a few questions after runnning on single-cell ATAC-Seq libraries generated from the 10X multiome kit. I ran using the following commands: python /path/to/software/ATAC-amp/AtacAmp.py \ --bam \ --name \ --isize_value 1000 \ --interval_size 1000 \ --mapq 30 \ --mode 0 \ --type sc \ --gtf /path/to/hg38.ncbiRefSeq.gtf \ --threads 16

I've attached output files from this run in "TEST_output.zip"

I just have a few questions:

  1. Could you suggest how we interpret the output files of ATAC-amp, and how this can help prioritise identification of genuine eCDNA amplicons? Out of those in the attached "TEST_output.zip" do you think there are any promising candidates?
  2. Please could you provide more details on some of the parameter settings: a. --mode 0, 1, 2 - what are these different modes, and what is "--discbk"? b. --isize_value - what is this value and is 1000 recommended for both bulk ATAC-Seq and single-cell ATAC-Seq data? c. --interval_size - as above what does this parameter control and is 1000 recommended for both bulk and single-cell ATAC-Seq data?
  3. Is it possible to estimate the count/abundance of candidate ecDNA amplicons from the output of ATAC-amp?

Thanks very much

Best wishes

Ben TEST_output.zip

chsmiss commented 8 months ago

Hi ben, For your first question, "how we interpret the output files of ATAC-amp, and how this can help prioritise identification of genuine eCDNA amplicons? " For bulk ATAC-seq data in ‘bulk’ mode, there is only one main result from ATACAmp, the ‘.result’ file, which contains the possible eCDNA/hsr forming regions, ordered by score from highest to lowest will be. In single-cell ATAC data in ‘sc’ mode, this file will be slightly different, and in the last line of each possible ecdna/hsr region, there will be the barcode of cells that supports these regions for subsequent analyses at the cell population level. However, the results of the current ATACAmp analysis are very susceptible to the quality of the data, so for your data, I would suggest to do QC before analysing it using high quality reads, what I understand is that there are not many cases of fragments on chrY forming ecDNA, and you can prioritise regions carrying oncogenes and regions larger than 100kb.

About some parameters you mentioned 1, -Mode 0, 1, 2 is on behalf of using different input files to run ATACAmp, ‘0’ mode accept the bam file, ‘1’ mode accept the split reads and discordant reads file, and ‘2’ mode accept the interval file, in order to get the breakpoint information from other software for analysing and saving the time of running after adjusting the parameters. 2, -isize_value is the insert size of the discordant reads, this is related to the sequencing library construction method, but 1000 is a more suitable value for most of the second-generation sequencing methods on the market. 3, --interval_size controls the step size from the breakpoint when calculating the amplified region, 1000 is an empirical parameter, you can also try a larger value to speed up the calculation, or use a smaller value to make the boundaries finer. Finally, at the moment ATACAmp still has limited resolution of single cells and cannot analyse abundance for the time being, but we will continue to build on this software with updates to detect variants in conjunction with new single-cell genome-level sequencing technologies.

weihong1991 commented 1 week ago

For bulk ATAC-seq data in ‘bulk’ mode, there is only one main result from ATACAmp, the ‘.result’ file, which contains the possible eCDNA/hsr forming regions, ordered by score from highest to lowest will be.

I also have a few questions about the tool’s output for bulk data analysis and would appreciate some clarification.

1) I noticed that some interval sets include a main (or “max”) cycle along with several smaller cycles. Could you explain the relationship between these different cycles? Also, for subsequent analyses, would you recommend focusing on the max cycle, or are the smaller cycles equally important to consider?

2) I also observed that some interval sets contain many intervals (see below) though only a subset of these are included in the identified cycles. Could you clarify the relationship between intervals that are part of cycles and those that are not? Understanding this distinction will help me interpret the results more accurately.

Thank you very much for your assistance

GSM7634668.result_amplicon interval sets3: 107,1340,1,1336,143,110,128,1338,144,1342,2837,1333,1334,3074,129    245     180000
107     chr1    121182712       121187712       5000    SRGAP2C,FAM72B
1340    chr11   67627086        67635086        8000    TBX10,NUDT8
1       chr1    174020727       174024727       4000    RC3H1
1336    chr11   63936906        63939906        3000    NAA40
143     chr1    149837418       149844418       7000    H3C14,H2AC19,H2AC18,H3C15
110     chr1    143971238       143975238       4000    SRGAP2D,FAM72C
128     chr1    145092630       145096630       4000    FAM72D,FAM72C,SRGAP2B
1338    chr11   64776822        64779822        3000    SF1
144     chr1    149847968       149856968       9000    H3C14,H2AC19,H3C15,H2BC20P,H2AC18
1342    chr11   65414590        65452590        38000   MIR612,NEAT1
2837    chr20   53567970        53649970        82000   ZNF217,LOC105372672,LOC101927770
1333    chr11   61391251        61394251        3000    TMEM216
1334    chr11   64303709        64307709        4000    ESRRA,CATSPERZ,KCNK4-TEX40
3074    chr3    72445741        72448741        3000    RYBP
129     chr1    144959583       144962583       3000    SRGAP2B
max cycle:107,1,128,110,107
cycle 1: 1,110,107
107,1
107,110
110,1
cycle 2: 1,128,110
110,1
110,128
128,1