Closed FlorianErger closed 12 months ago
Hi @FlorianErger, Thanks for using the tool!
Indeed, the tool breaks right now when encountering an unknown cigar value. A potential quick fix could involve skipping variants (without de novo calling) that have the unknown cigar value.
I understand, that if the cigar value of 6 is prevalent in your dataset, this solution may not be viable. But if this unknown cigar value is rare, this approach could allow the tool to run smoothly across the entire dataset, only omitting a small number of variants.
Please let me know what you think about that.
Best, Gelana
Dear Gelana,
thanks for the suggestion. Indeed, we are currently working around the issue by filtering the bam first and getting rid of the reads (<1% in our alignments). The tool then works fine, but the additional step increases the runtime significantly from ~5 min to ~40 minutes.
If the read skipping could be implemented in your tool, it would be very useful. Most likely, all variants would still be covered sufficiently and no variants would be "lost" at all.
Best, Florian
If anyone else has similar issues, I solved it by changing the encode_pileup function in variants.py as follows:
def encode_pileup(self):
"""
Iterates over all the reads in the area of interest and
encodes every read as 2 numpy arrays:
encoded nucleotides and corresponding qualities
"""
for idx, read in enumerate(self.bam_data.fetch(reference=self.chromosome, start=self.start, end=self.end)):
if idx >= IMAGE_HEIGHT:
break
self.pileup_encoded[idx, :], self.quality_encoded[idx, :] = (
self._get_read_encoding(read, False)
)
to
def encode_pileup(self):
"""
Iterates over all the reads in the area of interest and
encodes every read as 2 numpy arrays:
encoded nucleotides and corresponding qualities
"""
count = 0
for read in self.bam_data.fetch(reference=self.chromosome, start=self.start, end=self.end):
if count >= IMAGE_HEIGHT:
break
try:
self.pileup_encoded[count, :], self.quality_encoded[count, :] = (
self._get_read_encoding(read, False)
)
count += 1
except:
continue
This skips bad reads.
I also changed
def start_coverage(self):
start_coverage_arrays = self.bam_data.count_coverage(self.chromosome, self.start-1, self.start)
return sum([coverage[0] for coverage in start_coverage_arrays])
to
def start_coverage(self):
try:
start_coverage_arrays = self.bam_data.count_coverage(self.chromosome, self.start-1, self.start)
return sum([coverage[0] for coverage in start_coverage_arrays])
except:
return 0
but I'm not certain it's necessary (had a crash while tinkering that was fixed by this change, but then made subsequent change to encode_pileup as detailed above which may have also solved the issue)...
Best, Florian
Dear Gelana,
we are trying to implement DeNovoCNN into our pipeline and are facing some problems. I think I have a working setup, but during execution of the script:
a series of errors occur, the first and I think relevant being:
Our aligner uses the cigar type 6 (BAM_CPAD or "P" in the cigar string) on occasion. The other currently unimplemented types are not used, so for us the problem is only the type 6/"P".
Would it be possible to implement the cigar value 6?
Thanks and best, Florian