MariaNattestad / Assemblytics

Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.
http://assemblytics.com
MIT License
136 stars 28 forks source link

Understanding output bed file #30

Closed ViriatoII closed 4 years ago

ViriatoII commented 4 years ago

Hi!

Thank you very much for your software! I'm trying to understand the output bed file I generate (I'm running Assemblytics through Ragoo), particularly the last 3 columns: method query_gap_ovlp ref_gap_ovlp

1) What does it mean to be between alignments and within alignment?

2) What exactly is the gap overlap? In my case I expected ref gap overlap to be higher, since my reference has Ns (gaps), but this is always 0. On the other hand, my query is pacbio reads without Ns, but the query gap overlap varies between 0 and 0.4

3) What is query ID, is it Chr0_RaGOO in my case?

cat assemblytics_out.Assemblytics_structural_variants.bed

reference   ref_start   ref_stop    ID  size    strand  type    ref_gap_size    query_gap_size  query_coordinates   method  query_gap_ovlp  ref_gap_ovlp
NC_024459.2 11616437    11621053    Assemblytics_b_35   390 +   Repeat_contraction  4616    4226    Chr0_RaGOO:173153-177379:+  between_alignments  0.047326076668244205    0
NC_024467.2 109161822   109163268   Assemblytics_b_77   108 +   Repeat_expansion    1446    1554    Chr0_RaGOO:361597-363151:+  between_alignments  0   0
NC_024467.2 63456481    63457373    Assemblytics_b_85   151 +   Repeat_contraction  892 741 Chr0_RaGOO:381099-381840:+  between_alignments  0   0
NC_024462.2 211455163   211456195   Assemblytics_b_105  117 +   Repeat_contraction  1032    915 Chr0_RaGOO:488221-489136:+  between_alignments  0   0
NC_024468.2 13053122    13053904    Assemblytics_b_112  53  +   Repeat_contraction  782 729 Chr0_RaGOO:524628-525357:+  between_alignments  0   0
NC_024464.2 76662942    76663032    Assemblytics_b_116  124 +   Repeat_expansion    90  214 Chr0_RaGOO:537110-537324:+  between_alignments  0   0
NC_024467.2 126319680   126320771   Assemblytics_b_120  1091    +   
NW_017972161.1  35583   35583   Assemblytics_w_8326 63  +   Insertion   0   63  Chr0_RaGOO:4439211-4439274:+    within_alignment    0   0
NW_017972161.1  36632   36632   Assemblytics_w_8327 67  +   Insertion   0   67  Chr0_RaGOO:4440324-4440391:+    within_alignment    0   0
NW_017972161.1  37228   37292   Assemblytics_w_8328 64  +   Deletion    64  0   Chr0_RaGOO:4440987-4440987:+    within_alignment    0   0
NW_017972161.1  37196   37196   Assemblytics_w_8329 67  +   Insertion   0   67  Chr0_RaGOO:11550635-11550702:+  within_alignment    0   0
NW_017972161.1  38245   38245   Assemblytics_w_8330 63  +   Insertion   0   63  Chr0_RaGOO:11551752-11551815:+  within_alignment    0   0
NW_017972070.1  8591    8766    Assemblytics_w_8331 175 +   Deletion    175 0   Chr0_RaGOO:13749835-13749835:+  within_alignment    0   0

Kind regards, Ricardo

MariaNattestad commented 4 years ago

Hi Ricardo

  1. MUMmer has aligned the query sequences (e.g. chromosomes or contigs) to the reference sequences, and if there are more than one alignment between a query and a reference, then Assemblytics analyzes the distances between the alignments and calls "between alignment" variants from them. "Within alignment" variants are found inside one alignment, so these tend to be smaller. The details are described in the paper, specifically in the supplement: https://www.biorxiv.org/content/10.1101/044925v1
  2. Those overlap columns must have been added by RaGOO, since they are not from the original Assemblytics. See an example here: http://assemblytics.com/analysis.php?code=yeast
  3. Yes, query ID is Chr0_RaGOO in your case. It's the name of the query sequence, like the contig ID or chromosome name.

Maria

ViriatoII commented 4 years ago

Thanks!

Have a nice day, Ricardo

ViriatoII commented 4 years ago

Good morning @MariaNattestad ,

Coming back to the output bed file, how do I make sense of the coordinates, sizes and gap sizes?

For example, here I have:

ref_start ref_end size type ref_gap_size query_gap_size
11616437 11621053 390 Repeat_contraction 4616 4226
109161822 109163268 108 Repeat_expansion 1446 1554

So : ref_start - ref_end = ref_gap_size
and
ref_gap_size - query_gap_size = size

What does this mean for my called SVs? Where do the SV actually start and end? And why are these gap sizes in the picture?

Thank you, Ricardo

MariaNattestad commented 4 years ago

The alignments look something like this for repeat contractions and expansions: repeat expansions and contractions diagram from Assemblytics paper (from Assemblytics paper main text) The information in the file is describing the gap between those two alignments both from the perspective of the reference (i.e. ref_start, ref_end, and ref_gap_size) and of the query. So you ask where your SVs actually start and end. The simplest answer is just the ref_start and ref_end. But the extra information is there to give you more context.

ViriatoII commented 4 years ago

Thank you once more! I'll read the paper more thoroughfully. Have a nice day.