PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
62 stars 5 forks source link

Supplementary reads not processed - aligned_bam_to_cpg_scores #46

Closed Audald closed 1 year ago

Audald commented 1 year ago

CRAM/BAM files with supplementary reads cannot be processed by aligned_bam_to_cpg_scores (both for version 2.2 and 2.3). The job is immediately stopped with the error thread '<unnamed>' panicked at 'Read sequence is too short for MM tag in record: {qname}'.

The BED and BigWig files are successfully generated if the secondary and supplementary alignments are filtered out from the CRAM/BAM files with -F 2308.

I am reporting this issue in case you want to consider filtering as a user option.

ctsa commented 1 year ago

Thanks @Audald -

I suppose this must be from supplementary reads with hard-clipping? If so, the bam spec doesn't provide much on how hard-clip and methylation tags should interact. By my reading, once hard-clipping is introduced there's no way the MM/ML tags could be interpretable unless they were clipped as well. Can I ask what read mapper and settings you're using for this case?

ctsa commented 1 year ago

More discussion on this topic here:

https://github.com/samtools/hts-specs/issues/646

Looks like to avoid this issue you need to add the -Y flag if you're using minimap2, or use pbmm2. I will update the README to note this.

Audald commented 1 year ago

Thanks @ctsa for looking into this. The hts-specs discussion is very helpful and the suggested flag in minimap2 should be enough. Thanks for indicating it also in the documentation!

ctsa commented 1 year ago

Sounds good. Updated readme:

https://github.com/PacificBiosciences/pb-CpG-tools#input-alignment-file

Closing this issue as explained.