Genome-Bioinformatics-RadboudUMC / DeNovoCNN

A deep learning approach to de novo variant calling in next generation sequencing data
GNU General Public License v3.0
12 stars 2 forks source link

BAM_CPAD cigar type not supported #11

Closed FlorianErger closed 10 months ago

FlorianErger commented 10 months ago

Dear Gelana,

we are trying to implement DeNovoCNN into our pipeline and are facing some problems. I think I have a working setup, but during execution of the script:

sudo docker run -v /mnt/NFS/hum_workbench/Datenübernahme:/workdir gelana/denovocnn:1.0 /app/apply_denovocnn.sh --workdir=/workdir --child-vcf="/workdir/32796S13(paired)_Identified_Variants_08102023.vcf" --father-vcf="/workdir/32797S14(paired)_Identified_Variants_08102023.vcf" --mother-vcf="/workdir/32798S15(paired)_Identified_Variants_09102023.vcf" --child-bam="/workdir/32796S13(paired)_Read_Mapping_08102023.bam" --father-bam="/workdir/32797S14(paired)_Read_Mapping_08102023.bam" --mother-bam="/workdir/32798S15(paired)_Read_Mapping_09102023.bam" --snp-model=/app/models/snp --in-model=/app/models/ins --del-model=/app/models/del --genome="/workdir/bed_vcf_Referenzdateien/hg19.fa" --output=32796_predictions.csv

a series of errors occur, the first and I think relevant being:

Traceback (most recent call last): File "/app/denovonet/dataset.py", line 236, in load_variant return SingleVariant(str(chromosome), int(start), int(end), bam, reference_genome) File "/app/denovonet/variants.py", line 61, in init self.encode_pileup() File "/app/denovonet/variants.py", line 112, in encode_pileup self._get_read_encoding(read, False) File "/app/denovonet/variants.py", line 161, in _get_read_encoding base_end_position, genome_end_position = self._update_positions( File "/app/denovonet/variants.py", line 254, in _update_positions raise ValueError('Unsupported cigar value: {}'.format(cigar_value)) ValueError: Unsupported cigar value: 6

Our aligner uses the cigar type 6 (BAM_CPAD or "P" in the cigar string) on occasion. The other currently unimplemented types are not used, so for us the problem is only the type 6/"P".

Would it be possible to implement the cigar value 6?

Thanks and best, Florian

gelana commented 10 months ago

Hi @FlorianErger, Thanks for using the tool!

Indeed, the tool breaks right now when encountering an unknown cigar value. A potential quick fix could involve skipping variants (without de novo calling) that have the unknown cigar value.

I understand, that if the cigar value of 6 is prevalent in your dataset, this solution may not be viable. But if this unknown cigar value is rare, this approach could allow the tool to run smoothly across the entire dataset, only omitting a small number of variants.

Please let me know what you think about that.

Best, Gelana

FlorianErger commented 10 months ago

Dear Gelana,

thanks for the suggestion. Indeed, we are currently working around the issue by filtering the bam first and getting rid of the reads (<1% in our alignments). The tool then works fine, but the additional step increases the runtime significantly from ~5 min to ~40 minutes.

If the read skipping could be implemented in your tool, it would be very useful. Most likely, all variants would still be covered sufficiently and no variants would be "lost" at all.

Best, Florian

FlorianErger commented 10 months ago

If anyone else has similar issues, I solved it by changing the encode_pileup function in variants.py as follows:

def encode_pileup(self): """ Iterates over all the reads in the area of interest and encodes every read as 2 numpy arrays: encoded nucleotides and corresponding qualities """ for idx, read in enumerate(self.bam_data.fetch(reference=self.chromosome, start=self.start, end=self.end)): if idx >= IMAGE_HEIGHT: break self.pileup_encoded[idx, :], self.quality_encoded[idx, :] = ( self._get_read_encoding(read, False) )

to

def encode_pileup(self): """ Iterates over all the reads in the area of interest and encodes every read as 2 numpy arrays: encoded nucleotides and corresponding qualities """ count = 0 for read in self.bam_data.fetch(reference=self.chromosome, start=self.start, end=self.end): if count >= IMAGE_HEIGHT: break try: self.pileup_encoded[count, :], self.quality_encoded[count, :] = ( self._get_read_encoding(read, False) ) count += 1 except: continue

This skips bad reads.

I also changed

def start_coverage(self): start_coverage_arrays = self.bam_data.count_coverage(self.chromosome, self.start-1, self.start) return sum([coverage[0] for coverage in start_coverage_arrays])

to

def start_coverage(self): try: start_coverage_arrays = self.bam_data.count_coverage(self.chromosome, self.start-1, self.start) return sum([coverage[0] for coverage in start_coverage_arrays]) except: return 0

but I'm not certain it's necessary (had a crash while tinkering that was fixed by this change, but then made subsequent change to encode_pileup as detailed above which may have also solved the issue)...

Best, Florian