How can I identify ecDNAs from PrepareAA results?

AmpliconSuite / AmpliconSuite-pipeline

A quickstart tool for AmpliconArchitect. Performs all preliminary steps (alignment, CNV calling, seed interval detection) required prior to running AmpliconArchitect. Previously called PrepareAA.

Other

53 stars 28 forks source link

How can I identify ecDNAs from PrepareAA results? #1

Closed kerenzhou062 closed 3 years ago

kerenzhou062 commented 4 years ago

Hi Jens,

I ran the PC3_prostate WGS data provided in NATURE paper with PrepareAA on GCRh38. The results can be downloaded from here.

What I am interested in is ecDNA. I've read the File Formats part, if I understand correctly, the Cycles section in {out}_amplicon{id}_cycle.txt file represent the predicted ecDNA. Is that correct?

Thanks, Keren

jluebeck commented 4 years ago

Hi Keren,

For a description of the cycles file please see the the AA documentation related to the cycles file. For instance, you will see a description on how to interpret the results, including the "+0" source nodes. The elements in the cycles file represent maximum-weight paths in the breakpoint graph. These can be thought of as plausible reconstructions which explain the copy number and discordant edges present. Cyclic paths are suggestive of ecDNA.

kerenzhou062 commented 4 years ago

Hi Jens, Thank you for your reply. So, supposed that I got the results like below:

Cycle=1;Copy_count=8.41915844112;Segments=66+,75-
Cycle=2;Copy_count=25.0827523522;Segments=0+,35-,54-,50-,18+,0-
Cycle=3;Copy_count=3.44135333252;Segments=0+,55-,50-,20+,59-
Cycle=4;Copy_count=3.10691662412;Segments=21+,2+,10+,4+,64-,0-
Cycle=5;Copy_count=14.7175752734;Segments=0+,56-,0-
Cycle=6;Copy_count=6.99511814997;Segments=47+

According to the description and your explanation, Cycle-1 is the true and complete predicted ecDNA, Cycle-2, Cycle-3 and Cycle-4 may be part of the true ecDNA, Cycle-5 may be not the true ecDNA and Cycle-6 is the linear one. Are these correct?

And are strand "+" and "-" representing the sequence came from positive and negative strand of the reference genome respectively?

Thank you, Keren

jluebeck commented 4 years ago

This interpretation is not quite right. AA does not rank results by how likely they are to be ecDNA. High copy-number rearranged cyclic structures are typically assumed be ecDNA. However biologically speaking, ecDNA maybe observed integrated into chromosomes as well, so some of the high-CN non-cyclic reconstructions may be ecDNA in origin. It is a subtle distinction and I encourage you to review some of the biology described in the Turner et al., 2017 Nature paper. Cycle-6 is not suggested to be linear - please see the documentation on interpreting cycles file output in which the 0+,...,0- notation is used to distinguish non-cyclic reconstructions from cyclic ones. Also, note that the segments listed here may come from overlapping regions of the genome.

The +/- notation accompanying each segment number describes the orientation of that segment of the reference genome.

kerenzhou062 commented 4 years ago

Hi Jens, Thank you for your explanations.

So, all of the cycles in Cycles section are the suggestive ecDNAs or part of ecDNAs, right?

Cycles like 0+,...,0- may be due to the integration into chromosomes of ecDNA or only partial ecDNA can be detected because of some unknown reasons. Is that right?

Thanks, Keren

jluebeck commented 4 years ago

Hi Keren, yes that is correct! However, low copy number, non-cyclic elements may likely not be ecDNA. So some filtering may be required if you are interested in only the suggestive ecDNAs or parts of ecDNAs.

And yes, the interpretation of the 0+,...,0- cycles is correct.

kerenzhou062 commented 4 years ago

Hi Jens,

Thank you so much for your reply! Here I have one more question about the strand that I have to make sure that my understand is 100% correct.

Suppose that Cycle-1=66+,75-, of which the sequences of 66 are AAATTTGGG(+) and CCCAAATTT(-) in the reference gnome, while 75 are CCGGAAT(+) and ATTCCGG(-). So, the linear presentation of Cycle-1 could be AAATTTGGG ATTCCGG and CCGGAAT CCCAAATTT, right?

Thanks, Keren

jluebeck commented 4 years ago

In this case Cycle 1 (66+,75-) would have the sequence AAATTTGGG ATTCCGG, while the reverse complement (75+,66-) would have the sequence CCGGAAT CCCAAATTT, which is what you wrote out I believe.

kerenzhou062 commented 4 years ago

Hi Jens,

Thank you so much!

I checked the identification results and found that there were actually some Cycle elements with low copy numbers (less than 5, which is the suggestive cutoff for seed intervals). What's your suggestion for the cutoff of copy number for non-cyclic amplicons that are likely to be ecDNAs?

Thanks, Keren

virajbdeshpande commented 4 years ago

Hello Keren,

It is possible that the amplicon has cycles with low copy numbers due to a couple of reasons: 1) The amplicon itself is not very high copy number 2) The amplicon has heterogenous structure and it has some cycles with very low copy number.

However the low copy number cycles may also arise due to errors: 1) The reconstruction is poor with too many false edges. Due to this, the cycle decomposition algorithm over-decomposes the amplicon into too many low copy cycles. 2) The amplicon has a complicated structure with a large number of edges. In this case, it is difficult to fully resolve the amplicon structure accurately and some true cycles may be assigned low copy numbers.

The copy number threshold is a judgement call. You can look at the SVVIEW (png image) generated by AA. In our paper, we selected 5 as a reasonable threshold for high-copy amplicons because we rarely observed this level of germline amplification in TCGA. Overall high copy amplicons had a exponential copy number distribution (with mean 3.16). So it is possible that there are somatic amplicon/ecDNA with smaller average copy numbers than 5. See Fig 2 here: https://www.nature.com/articles/s41467-018-08200-y#Sec2

kerenzhou062 commented 4 years ago

Thank you so much, Viraj!