daveuu / baga

Bacterial and Archaeal Genome Analyser
GNU General Public License v3.0
9 stars 2 forks source link

Repeats plot interpretation #2

Closed pauruihu closed 8 years ago

pauruihu commented 8 years ago

Hi! I've searched inside the code the meaning of the different colours used on the repeats plotting, but I can't figure it out by my own (I'm still a beginner in this field...). In order to make a good interpretation of my data, could you tell me the reason of using each colour? I have regions in blue, purple and pink in some pairs. Thank you.

daveuu commented 8 years ago

Hi Paula: good question - you shouldn't have to figure it out yourself! I haven't published the documentation yet so until I do (which should be in a few days):

Purple blocks indicate regions of at least 98% nucleotide identity between the pair of regions in the current plot, and that are longer than the average sequencing fragment insert size. Within these regions, variants called from aligned reads are deemed to be ambiguous with respect to chromosomal location (because the paired end fragment size is too short to resolve which repeat the reads should be aligned to). If you apply the baga repeats filter, variants in these regions will be marked in the VCF files and omitted from further analysis.

Pink blocks indicate regions of at least 98% nucleotide identity, but shorter than the average sequencing fragment insert size. Although part of repeated regions, your paired end reads should be able to resolve where in the chromosome these regions are and so the reads can be used to unambiguously call variants in these regions.

Blue blocks indicate regions within which variants called from aligned reads are omitted on the basis of pair-wise nucleotide identity comparisons between other repeat pairs (in other plots) and correspond to purple blocks in other figures featuring one of the two repeat regions displayed in the current figure. You should only see blue blocks for 'homologous groups' (regions sharing the same sequence at high identity) with three or more repeats because the plots are pair-wise. Sequence repeated once produces a single pair and a single plot.

Other aspects of the plot include: for each region, percent nucleotide identity to the other repeat in the same plot (light grey; moving window average); open reading frames are indicated in the lower two lanes labelled with locus ID and where available, gene name. Large genomic elements such as prophage and genomic islands are indicated in the upper-most lane, when present. This information is taken from the genome annotation obtained from GenBank or RefSeq in the baga/baga_cli.py CollectData --genomes ACCESSION command.

daveuu commented 8 years ago

Do ask again if anything is not explained clearly enough.

pauruihu commented 8 years ago

Crystal-clear! Thank you very much!