Add support for long-read bams (genome)

bcgsc / mavis

Merging, Annotation, Validation, and Illustration of Structural variants

http://mavis.bcgsc.ca

GNU General Public License v3.0

74 stars 14 forks source link

Add support for long-read bams (genome) #210

Open oneillkza opened 4 years ago

oneillkza commented 4 years ago

Reading in vcfs from variant callers that run on long-read bams is only part of the problem. MAVIS still needs bam files for most operations. Such bams have a few key differences from short-read ("NGS") sequence:

Single end rather than paired-end
Variable (and long) read length
Relatively high error rate (5-10%), especially for homopolymers

This makes them very good for detecting large structural variants, especially since they can map through low-complexity regions, but less good for smaller variants.

This ticket is to track work on reading in long-read genome bams.

oneillkza commented 4 years ago

So, the first major design decision is to create a new file type, genome_longread for long read genomic bams. This is distinct from genome, for short read paired-end genomic bams. I'm probably going to be copying a lot of the code to handle the genome bam type, but I think that'll be cleaner than having if statements everywhere.

e.g. in stats I've created compute_genome_longread_bam_stats, which is a modified copy of compute_genome_bam_stats

oneillkza commented 4 years ago

OK, got it as far as being able to do config and setup. Clustering works, but it fails on validate.

ValueError: ('protocol error', 'genome_longread')

This is somewhat unsurprising. Looks like the next step is to create a class in validate/evidence.py, and a case in validate/main.py to match up the genome_longread protocol to.