adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

Difficulty recognizing BAM file? #22

Closed malonematt closed 2 years ago

malonematt commented 2 years ago

I was using tldr to interrogate some output BAM files from the methylation caller Megalodon, I'm using the BAM files Megalodon (a methylation caller that uses sequence alignment to improve its methylation calling, or at least that is my understanding) can give as outputs (so these files should include methylation data, which might change how the data is formatted?). I wanted to use these rather than from a typical aligner like minimap2 to preserve the methylation information on them, but was wondering if that was causing my issue. Anyways here is the error output:

Traceback (most recent call last): File "/home1/malonema/.local/bin/tldr", line 2128, in main(args) File "/home1/malonema/.local/bin/tldr", line 1807, in main bam = pysam.AlignmentFile(bams[0]) File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 1000, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False

and here is the script I'm running

tldr -b $bamList -r $reference -o $outBaseName -e none --detail_output --extend_consensus 100 --procs 32

adamewing commented 2 years ago

Yes I've noticed that the .bams output by megalodon don't seem to work directly with tldr despite their being generated internally via minimap2. If you sort and index them it resolves the error but yields no output. I'll investigate further.

adamewing commented 2 years ago

So I've tracked this down, the megalodon docs here: https://nanoporetech.github.io/megalodon/file_formats.html#modified-base-mapping, the "reads" in mod_mappings.bam aren't actually the nanopore reads but are sequences from the reference genome. This is a bit unfortunate as it makes mod_mappings.bam unsuitable for most downstream applications dealing with sequence variation (variant detection, phasing, etc). The other bam file, mappings.bam may be suitable for TLDR but I'm not sure as it seems to involve mappy, which to my understanding doesn't output soft-clipped reads. You're probably better off re-aligning the .fastq output by megalodon with minimap2 and using that output for TLDR.

adamewing commented 2 years ago

closing due to likely resolution