Closed alhafidzhamdan closed 1 year ago
Hi,
"Complex non-cyclic" is our newer terminology for "heavily rearranged". Regarding the example you posted, the right endpoint of the segment on the left, and the left endpoint of the segment on the right are very close in the reference genome. On visual inspection, this sample looks like it is probably ecDNA, however because there is a gap in the two seed regions, it does not close the cycle, and thus AmpliconClassifier calls it complex non-cyclic. Did you run ampified_intervals.py
in the AA repo on your CNV seeds prior to running AA? Typically such close regions get merged, so that the region in the middle is included - closing the cycle. You could also re-run AA using a manually merged seed region that includes the absolute left and right bounds from both segments. This will likely improve the classification reliability in the case you posted. Please do not hesitate to let me know if there are more questions.
Best regards, Jens
Hey there, Thanks for getting back. Here's my original bed file
chr7 51530925 54155950 E26T 3 AC005999.2
chr7 54155950 55037285 E26T 20 AC092848.1
chr7 55037285 55060797 E26T 32 EGFR
chr7 55060797 55066800 E26T 18 EGFR
chr7 55066800 55132836 E26T 32 EGFR
chr7 55132836 55135847 E26T 17 EGFR
chr7 55135847 55154336 E26T 4 EGFR
chr7 55154336 55219299 E26T 19 EGFR
chr7 55219523 55220476 E26T 3 EGFR
chr7 55221270 55377914 E26T 19 ELDR
chr7 55377914 55378664 E26T 6 LANCL2
chr7 55378664 55389661 E26T 19 LANCL2
chr7 55389661 55401657 E26T 5 LANCL2
chr7 55401657 55403656 E26T 4 LANCL2
chr7 55403656 55617986 E26T 5 LANCL2
chr7 55617986 55733593 E26T 19 CDC42P2
chr7 55733593 56412202 E26T 3 CICP11
And here's the seed region after running amplified_intervals.py.
chr7 55037285 55389661 22.441628204 /exports/igmm/eddie/WGS/variants/bcbio/E26/E26T/E26-cnvkit.bed
For some reason it left that one segment of CN 20 out of the seed region.
By manually merging the region, do you mean simply altering the start coordinate of the bed file produced by amplified_intervals.py?
Thanks! A
Here's the merged seed version. Seems to have combined them.
Seems like now it's ecdna.
Another question is why are the vertical lines describing coverage/CNs now seem more sparse compared to the original one?
Awesome, glad that worked out.
The visualized density of the grey bars describing coverage can indeed vary across different amplicon segment sizes. For the previous, unmerged plot, the density was higher because the segment was smaller. Unfortunately control of the aspects of these plots is not something AA provides a lot of manual user control over.
Best, Jens
No worries- thanks. Just while we are here: I'm struggling to understand what the information detailed in amplicon_graph.txt and amplicon_cycles.txt mean. I wonder if you could help me understand ( this is related to another amplicon of another sample).
cat E28T_amplicon1_graph.txt
SequenceEdge: StartPosition, EndPosition, PredictedCopyCount, AverageCoverage, Size, NumberReadsMapped
sequence chr7:54441020- chr7:54541155+ 2.778255680525683 14.454116203 100135 14492
sequence chr7:54541156- chr7:54549418+ 43.71886306210519 174.60766857 8262 14446
sequence chr7:54549419- chr7:54952294+ 43.97425378122906 220.037355632 402875 887594
sequence chr7:54952295- chr7:54953096+ 7.528738256400183 44.0841376199 801 354
sequence chr7:54953097- chr7:55153073+ 43.97425378122925 224.148049516 199976 448809
sequence chr7:55153074- chr7:55164464+ 12.722492535147941 64.4610094561 11390 7352
sequence chr7:55164465- chr7:55166692+ 43.974253781229216 233.816875088 2227 5216
sequence chr7:55166693- chr7:55167415+ 3.6048782924591425 18.9249929158 722 137
sequence chr7:55167416- chr7:55219064+ 43.9742537812294 223.113493807 51648 115381
sequence chr7:55219065- chr7:55221367+ 1.429267778718257 7.7193284491 2302 178
sequence chr7:55221368- chr7:55391129+ 43.97425378122945 216.451433884 169761 367915
sequence chr7:55391130- chr7:55491063+ 3.033646399649837 14.6162532576 99933 14625
BreakpointEdge: StartPosition->EndPosition, PredictedCopyCount, NumberOfReadPairs, HomologySizeIfAvailable(<0ForInsertions), Homology/InsertionSequence
source chr7:-1-->chr7:54441020- 2.778255680525678 14 None None
concordant chr7:54541155+->chr7:54541156- 2.7782556805256777 10 None None
source chr7:-1-->chr7:54549419- 0.25539072622343667 13 None None
concordant chr7:54549418+->chr7:54549419- 43.718863062104994 147 None None
concordant chr7:54952294+->chr7:54952295- 7.528738256400177 23 None None
concordant chr7:54953096+->chr7:54953097- 7.528738256400156 23 None None
concordant chr7:55153073+->chr7:55153074- 12.722492535147937 42 None None
concordant chr7:55164464+->chr7:55164465- 12.722492535147966 36 None None
concordant chr7:55166692+->chr7:55166693- 3.6048782924591265 3 None None
concordant chr7:55167415+->chr7:55167416- 3.6048782924591336 24 None None
concordant chr7:55219064+->chr7:55219065- 1.4292677787182604 3 None None
concordant chr7:55221367+->chr7:55221368- 1.4292677787182575 3 None None
concordant chr7:55391129+->chr7:55391130- 3.0336463996498373 3 None None
source chr7:-1-->chr7:55491063+ 3.033646399649835 19 None None
discordant chr7:54953097-->chr7:54952294+ 36.44551553192849 130 0
discordant chr7:55164465-->chr7:55153073+ 31.25176125318087 130 -2 TT
discordant chr7:55167416-->chr7:55166692+ 40.36937549586971 67 None None
discordant chr7:55221368-->chr7:55219064+ 42.54498600961079 153 15 TGCAATGTGCTTTTA
discordant chr7:55391129+->chr7:54541156- 40.940607388679155 41 2 GC
And
cat E28T_amplicon1_cycles.txt
Interval 1 chr7 54441020 55491063
List of cycle segments
Segment 1 chr7 54441020 54952294
Segment 2 chr7 54541156 54952294
Segment 3 chr7 54541156 55166692
Segment 4 chr7 54953097 55153073
Segment 5 chr7 54953097 55219064
Segment 6 chr7 55164465 55166692
Segment 7 chr7 55167416 55219064
Segment 8 chr7 55221368 55391129
Segment 9 chr7 55221368 55491063
Cycle=1;Copy_count=31.2517612532;Segments=2+,4+,6+,7+,8+
Cycle=2;Copy_count=7.5287382564;Segments=3+,7+,8+
Cycle=3;Copy_count=2.77825568053;Segments=0+,9-,5-,1-,0-
Sorry for all the questions!
Hi,
The AA Github has documentation on the _graph.txt file format. Concordant edges are concordant with the reference genome (no SV), discordant edges link two non-adjacent positions in the reference genome, and source edges connect the location given to an unknown destination. Discordant edges are by far the most useful for downstream analysis. You are correct that the microhomology at the location is given, however the microhomology size in the preceding column will be <0 if the sequence represents a novel insertion instead.
The cycle you posted has been decomposed three ways, and the weights of those decompositions is provided as a "copy count". There is more information on the cycles file here, including descriptions of the edge colorings. It is a little hard to interpret such cycles from this text file.
The coloring of the edges relates to the orientation of the read pairs. Red indicates they are in the expected orientation. These are actually small "jumps" across regions of the genome (see the discordant edges in the graph file). These can be due to small deletions in this region.
Because the cycles file itself is not super intuitive to read, we provide a utility called CycleViz, which can visualize AA paths and cycles in Circos-style plots, just using the graph and cycles files. For instance, I ran the following commands on the cycle and graph you provided and got a nice visualization of cycle 1 which I think is most informative for interpreting what's happening here:
# convert cycles file to use same segmentation as graph file
$CV_SRC/convert_cycles_file.py -c E28T_amplicon1_cycles.txt -g E28T_amplicon1_graph.txt
# run cycleviz with default parameters for gene selection, fontsize, etc.
$CV_SRC/CycleViz.py -g E28T_amplicon1_graph.txt --cycles_file E28T_amplicon1_BPG_converted_cycles.txt --cycle 1 --ref hg19
Looks like a very neat project you have! Please do not hesitate to reach out with more question or if you would like to discuss follow-up things!
Best regards, Jens
Hey there,
Thanks so much for your time! I understand it much better now- thanks for clarying what the red edges are. And the cycleviz.py tool is definitely a must- makes the edges much better to be visualised. I wonder if you could add a bed file as an inner circle? I checked the CycleViz.py script and there seems to be an option to add a bedgraph file. However it's throwing me this error:
CycleViz.py --cycles_file ../../../../WGS/variants/ecdna/AA_results/E28T/E28T_amplicon1_cycles.txt --cycle 1 --graph ../../../../WGS/variants/ecdna/AA_results/E28T/E28T_amplicon1_graph.txt --ref "hg38" --rot 90 --gene_fontsize 8 --bedgraph ../../../../analysis/bed_regions/TF_shared.bed
Bad key text.latex.unicode in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 112 ('text.latex.unicode : False # use "ucs" and "inputenc" LaTeX packages for handling')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
Bad key savefig.frameon in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 423 ('savefig.frameon : True')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
Bad key pgf.debug in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 444 ('pgf.debug : False')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
Bad key verbose.level in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 475 ('verbose.level : silent # one of silent, helpful, debug, debug-annoying')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
Bad key verbose.fileo in file /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 476 ('verbose.fileo : sys.stdout # a log filename, sys.stdout or sys.stderr')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.2/matplotlibrc.template
or from the matplotlib source distribution
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In /exports/igmm/eddie/WGS/anaconda/envs/snakemake/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle:
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
hg38
Reading genes
Unaligned fraction cutoff set to 0.016666666666666666
checking adjacency
Traceback (most recent call last):
File "CycleViz.py", line 438, in <module>
bed_data = parse_bed(bed)
File "/gpfs/igmmfs01/eddie/WGS/scripts/AmpliconArchitect/src/CycleViz/bionanoUtil.py", line 215, in parse_bed
with open(bedfile) as infile:
FileNotFoundError: [Errno 2] No such file or directory: 's'
Hi,
Unfortunately CycleViz does not yet accept bedgraph or other feature files for interior plotting. It's on my to-do list, however it will be some weeks before I have time to implement that feature. There are some common features we already know we should support. May I ask what kind of data you would like to plot on the interior, so that I can determine what all we should support?
Best regards, Jens
Hi, just the usual epigenomic data plotting like TF binding/chip-seq data. I'm sure this is already in your mind and i'll wait for your updates! Thanks
Hi @jluebeck, When you created that cycle diagram above i noticed you used the convert cycles script. I tried that script and one of the lines (for another sample) says
Cycle=5;Copy_count=5.27662141446;Length=2938Kbp;Segments=0+,78-,77-,76-,75-,74-,73-,72-,71-,70-,0-;Circular=FALSE;TRIVAL=TRUE
Does this mean this cycle is not considered circular? And what does "TRIVAL" mean?
Thanks! A
Sorry for the delay in responding -
That is correct - the path is non-cyclic. If a path is flanked by source edges (0+, 0-), it is not a cyclic path in the graph. 'TRIVAL' [sic] note indicates that the path is trivial - meaning that it has minimal deviation from a sequence of the reference genome. This annotation was used in our AmpliconReconstructor paper when selecting appropriate samples to use in the simulation study.
On another front - we are getting ready to release updates to CycleViz which allow for plotting of these kinds of interior tracks you suggested in your previous comment. We are probably about a week away from having it out and I will drop a comment when it is ready to use.
Best, Jens
Hi there! Thanks a lot. Does that mean cycles annotated with TRIVAL=TRUE should be discarded? aka artifact? As some of my cycles are circular but TRIVAL=TRUE. Congrats on that paper- that's a massive achievement!
I am looking into reconstructing my ecdna breakpoints but I don't have bionano data. Do you think relying on breakpoint graphs generated by AA using short read data is accurate enough for crispr experiments?
I look forward to that release- thank you for all you do and hope you have a lovely christmas period ahead.
Al
Cycles which are marked as "trivial" have a simple structure, however they can still participate in ecDNA, etc. This marking is only to help filter complex vs. simple cycles, and does not indicate artifact.
If the structure of the amplicon is relatively simple, the correct structure can often be determined manually or with some minimal automation. I have a script to brute force solve a plausible structure from the AA graph alone, if you are interested in trying it, send me an email (jluebeck ([at]) ucsd.edu. The script is otherwise not released publicly yet.
Likewise to you on the holidays, and I hope to have some updates soon.
Best regards, Jens
Hi @jluebeck
I'm trying your classification script and one of my amplicons showed up as "complex non cyclic"
Is this the same as "heavily-rearranged"? Thanks! A