Classification - Githubissues

alhafidzhamdan commented 3 years ago

Hi @jluebeck

I'm trying your classification script and one of my amplicons showed up as "complex non cyclic"

E26T_amplicon1_E26T_amplicon1_cycles_amp_class_radar

Is this the same as "heavily-rearranged"? Thanks! A

jluebeck commented 3 years ago

Hi,

"Complex non-cyclic" is our newer terminology for "heavily rearranged". Regarding the example you posted, the right endpoint of the segment on the left, and the left endpoint of the segment on the right are very close in the reference genome. On visual inspection, this sample looks like it is probably ecDNA, however because there is a gap in the two seed regions, it does not close the cycle, and thus AmpliconClassifier calls it complex non-cyclic. Did you run ampified_intervals.py in the AA repo on your CNV seeds prior to running AA? Typically such close regions get merged, so that the region in the middle is included - closing the cycle. You could also re-run AA using a manually merged seed region that includes the absolute left and right bounds from both segments. This will likely improve the classification reliability in the case you posted. Please do not hesitate to let me know if there are more questions.

Best regards, Jens

alhafidzhamdan commented 3 years ago

Hey there, Thanks for getting back. Here's my original bed file

chr7    51530925    54155950    E26T    3   AC005999.2
chr7    54155950    55037285    E26T    20  AC092848.1
chr7    55037285    55060797    E26T    32  EGFR
chr7    55060797    55066800    E26T    18  EGFR
chr7    55066800    55132836    E26T    32  EGFR
chr7    55132836    55135847    E26T    17  EGFR
chr7    55135847    55154336    E26T    4   EGFR
chr7    55154336    55219299    E26T    19  EGFR
chr7    55219523    55220476    E26T    3   EGFR
chr7    55221270    55377914    E26T    19  ELDR
chr7    55377914    55378664    E26T    6   LANCL2
chr7    55378664    55389661    E26T    19  LANCL2
chr7    55389661    55401657    E26T    5   LANCL2
chr7    55401657    55403656    E26T    4   LANCL2
chr7    55403656    55617986    E26T    5   LANCL2
chr7    55617986    55733593    E26T    19  CDC42P2
chr7    55733593    56412202    E26T    3   CICP11

And here's the seed region after running amplified_intervals.py.

chr7    55037285    55389661    22.441628204    /exports/igmm/eddie/WGS/variants/bcbio/E26/E26T/E26-cnvkit.bed

For some reason it left that one segment of CN 20 out of the seed region.

By manually merging the region, do you mean simply altering the start coordinate of the bed file produced by amplified_intervals.py?

Thanks! A

alhafidzhamdan commented 3 years ago

Here's the merged seed version. Seems to have combined them.

Seems like now it's ecdna.

Another question is why are the vertical lines describing coverage/CNs now seem more sparse compared to the original one?

jluebeck commented 3 years ago

Awesome, glad that worked out.

The visualized density of the grey bars describing coverage can indeed vary across different amplicon segment sizes. For the previous, unmerged plot, the density was higher because the segment was smaller. Unfortunately control of the aspects of these plots is not something AA provides a lot of manual user control over.

Best, Jens

alhafidzhamdan commented 3 years ago

No worries- thanks. Just while we are here: I'm struggling to understand what the information detailed in amplicon_graph.txt and amplicon_cycles.txt mean. I wonder if you could help me understand ( this is related to another amplicon of another sample).

cat E28T_amplicon1_graph.txt

SequenceEdge: StartPosition, EndPosition, PredictedCopyCount, AverageCoverage, Size, NumberReadsMapped
sequence    chr7:54441020-  chr7:54541155+  2.778255680525683   14.454116203    100135  14492
sequence    chr7:54541156-  chr7:54549418+  43.71886306210519   174.60766857    8262    14446
sequence    chr7:54549419-  chr7:54952294+  43.97425378122906   220.037355632   402875  887594
sequence    chr7:54952295-  chr7:54953096+  7.528738256400183   44.0841376199   801 354
sequence    chr7:54953097-  chr7:55153073+  43.97425378122925   224.148049516   199976  448809
sequence    chr7:55153074-  chr7:55164464+  12.722492535147941  64.4610094561   11390   7352
sequence    chr7:55164465-  chr7:55166692+  43.974253781229216  233.816875088   2227    5216
sequence    chr7:55166693-  chr7:55167415+  3.6048782924591425  18.9249929158   722 137
sequence    chr7:55167416-  chr7:55219064+  43.9742537812294    223.113493807   51648   115381
sequence    chr7:55219065-  chr7:55221367+  1.429267778718257   7.7193284491    2302    178
sequence    chr7:55221368-  chr7:55391129+  43.97425378122945   216.451433884   169761  367915
sequence    chr7:55391130-  chr7:55491063+  3.033646399649837   14.6162532576   99933   14625
BreakpointEdge: StartPosition->EndPosition, PredictedCopyCount, NumberOfReadPairs, HomologySizeIfAvailable(<0ForInsertions), Homology/InsertionSequence
source  chr7:-1-->chr7:54441020-    2.778255680525678   14  None    None
concordant  chr7:54541155+->chr7:54541156-  2.7782556805256777  10  None    None
source  chr7:-1-->chr7:54549419-    0.25539072622343667 13  None    None
concordant  chr7:54549418+->chr7:54549419-  43.718863062104994  147 None    None
concordant  chr7:54952294+->chr7:54952295-  7.528738256400177   23  None    None
concordant  chr7:54953096+->chr7:54953097-  7.528738256400156   23  None    None
concordant  chr7:55153073+->chr7:55153074-  12.722492535147937  42  None    None
concordant  chr7:55164464+->chr7:55164465-  12.722492535147966  36  None    None
concordant  chr7:55166692+->chr7:55166693-  3.6048782924591265  3   None    None
concordant  chr7:55167415+->chr7:55167416-  3.6048782924591336  24  None    None
concordant  chr7:55219064+->chr7:55219065-  1.4292677787182604  3   None    None
concordant  chr7:55221367+->chr7:55221368-  1.4292677787182575  3   None    None
concordant  chr7:55391129+->chr7:55391130-  3.0336463996498373  3   None    None
source  chr7:-1-->chr7:55491063+    3.033646399649835   19  None    None
discordant  chr7:54953097-->chr7:54952294+  36.44551553192849   130 0   
discordant  chr7:55164465-->chr7:55153073+  31.25176125318087   130 -2  TT
discordant  chr7:55167416-->chr7:55166692+  40.36937549586971   67  None    None
discordant  chr7:55221368-->chr7:55219064+  42.54498600961079   153 15  TGCAATGTGCTTTTA
discordant  chr7:55391129+->chr7:54541156-  40.940607388679155  41  2   GC

what does sequence/concordant/source/discordant indicate?
what does the sequence TGCAATGTGCTTTTA indicate? microhomology?

And

cat E28T_amplicon1_cycles.txt

Interval    1   chr7    54441020    55491063
List of cycle segments
Segment 1   chr7    54441020    54952294
Segment 2   chr7    54541156    54952294
Segment 3   chr7    54541156    55166692
Segment 4   chr7    54953097    55153073
Segment 5   chr7    54953097    55219064
Segment 6   chr7    55164465    55166692
Segment 7   chr7    55167416    55219064
Segment 8   chr7    55221368    55391129
Segment 9   chr7    55221368    55491063
Cycle=1;Copy_count=31.2517612532;Segments=2+,4+,6+,7+,8+
Cycle=2;Copy_count=7.5287382564;Segments=3+,7+,8+
Cycle=3;Copy_count=2.77825568053;Segments=0+,9-,5-,1-,0-

Where do these cycles get represented in the amplicon figure? For example I can't make out any separate cycles and read off the calculated copy counts ie no line at CN of 31.2 for example.
And what are the red vertical lines indicate?

Sorry for all the questions!

jluebeck commented 3 years ago

Hi,

The AA Github has documentation on the _graph.txt file format. Concordant edges are concordant with the reference genome (no SV), discordant edges link two non-adjacent positions in the reference genome, and source edges connect the location given to an unknown destination. Discordant edges are by far the most useful for downstream analysis. You are correct that the microhomology at the location is given, however the microhomology size in the preceding column will be <0 if the sequence represents a novel insertion instead.

The cycle you posted has been decomposed three ways, and the weights of those decompositions is provided as a "copy count". There is more information on the cycles file here, including descriptions of the edge colorings. It is a little hard to interpret such cycles from this text file.

The coloring of the edges relates to the orientation of the read pairs. Red indicates they are in the expected orientation. These are actually small "jumps" across regions of the genome (see the discordant edges in the graph file). These can be due to small deletions in this region.

Because the cycles file itself is not super intuitive to read, we provide a utility called CycleViz, which can visualize AA paths and cycles in Circos-style plots, just using the graph and cycles files. For instance, I ran the following commands on the cycle and graph you provided and got a nice visualization of cycle 1 which I think is most informative for interpreting what's happening here:

# convert cycles file to use same segmentation as graph file
 $CV_SRC/convert_cycles_file.py -c E28T_amplicon1_cycles.txt -g E28T_amplicon1_graph.txt
# run cycleviz with default parameters for gene selection, fontsize, etc.
$CV_SRC/CycleViz.py -g E28T_amplicon1_graph.txt --cycles_file E28T_amplicon1_BPG_converted_cycles.txt --cycle 1 --ref hg19

Looks like a very neat project you have! Please do not hesitate to reach out with more question or if you would like to discuss follow-up things!

Best regards, Jens

alhafidzhamdan commented 3 years ago

Hey there,

Thanks so much for your time! I understand it much better now- thanks for clarying what the red edges are. And the cycleviz.py tool is definitely a must- makes the edges much better to be visualised. I wonder if you could add a bed file as an inner circle? I checked the CycleViz.py script and there seems to be an option to add a bedgraph file. However it's throwing me this error:

CycleViz.py --cycles_file ../../../../WGS/variants/ecdna/AA_results/E28T/E28T_amplicon1_cycles.txt --cycle 1 --graph ../../../../WGS/variants/ecdna/AA_results/E28T/E28T_amplicon1_graph.txt --ref "hg38" --rot 90 --gene_fontsize 8 --bedgraph ../../../../analysis/bed_regions/TF_shared.bed

Bad key text.latex.unicode in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 112 ('text.latex.unicode : False # use "ucs" and "inputenc" LaTeX packages for handling')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution

Bad key savefig.frameon in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 423 ('savefig.frameon : True')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution

Bad key pgf.debug in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 444 ('pgf.debug           : False')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution

Bad key verbose.level in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 475 ('verbose.level  : silent      # one of silent, helpful, debug, debug-annoying')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution

Bad key verbose.fileo in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 476 ('verbose.fileo  : sys.stdout  # a log filename, sys.stdout or sys.stderr')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: 
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
hg38
Reading genes
Unaligned fraction cutoff set to 0.016666666666666666
checking adjacency
Traceback (most recent call last):
  File "CycleViz.py", line 438, in <module>
    bed_data = parse_bed(bed)
  File "/gpfs/igmmfs01/eddie/WGS/scripts/AmpliconArchitect/src/CycleViz/bionanoUtil.py", line 215, in parse_bed
    with open(bedfile) as infile:
FileNotFoundError: [Errno 2] No such file or directory: 's'

jluebeck commented 3 years ago

Hi,

Unfortunately CycleViz does not yet accept bedgraph or other feature files for interior plotting. It's on my to-do list, however it will be some weeks before I have time to implement that feature. There are some common features we already know we should support. May I ask what kind of data you would like to plot on the interior, so that I can determine what all we should support?

Best regards, Jens

alhafidzhamdan commented 3 years ago

Hi, just the usual epigenomic data plotting like TF binding/chip-seq data. I'm sure this is already in your mind and i'll wait for your updates! Thanks

alhafidzhamdan commented 3 years ago

Hi @jluebeck, When you created that cycle diagram above i noticed you used the convert cycles script. I tried that script and one of the lines (for another sample) says

Cycle=5;Copy_count=5.27662141446;Length=2938Kbp;Segments=0+,78-,77-,76-,75-,74-,73-,72-,71-,70-,0-;Circular=FALSE;TRIVAL=TRUE

Does this mean this cycle is not considered circular? And what does "TRIVAL" mean?

Thanks! A

jluebeck commented 3 years ago

Sorry for the delay in responding -

That is correct - the path is non-cyclic. If a path is flanked by source edges (0+, 0-), it is not a cyclic path in the graph. 'TRIVAL' [sic] note indicates that the path is trivial - meaning that it has minimal deviation from a sequence of the reference genome. This annotation was used in our AmpliconReconstructor paper when selecting appropriate samples to use in the simulation study.

On another front - we are getting ready to release updates to CycleViz which allow for plotting of these kinds of interior tracks you suggested in your previous comment. We are probably about a week away from having it out and I will drop a comment when it is ready to use.

Best, Jens

alhafidzhamdan commented 3 years ago

Hi there! Thanks a lot. Does that mean cycles annotated with TRIVAL=TRUE should be discarded? aka artifact? As some of my cycles are circular but TRIVAL=TRUE. Congrats on that paper- that's a massive achievement!

I am looking into reconstructing my ecdna breakpoints but I don't have bionano data. Do you think relying on breakpoint graphs generated by AA using short read data is accurate enough for crispr experiments?

I look forward to that release- thank you for all you do and hope you have a lovely christmas period ahead.

Al

jluebeck commented 3 years ago

Cycles which are marked as "trivial" have a simple structure, however they can still participate in ecDNA, etc. This marking is only to help filter complex vs. simple cycles, and does not indicate artifact.

If the structure of the amplicon is relatively simple, the correct structure can often be determined manually or with some minimal automation. I have a script to brute force solve a plausible structure from the AA graph alone, if you are interested in trying it, send me an email (jluebeck ([at]) ucsd.edu. The script is otherwise not released publicly yet.

Likewise to you on the holidays, and I hope to have some updates soon.

Best regards, Jens

AmpliconSuite / AmpliconClassifier

Classification #1