ft extract and pysam return different molecular/reference coordinates for the same read

MariosEft97 commented 5 months ago

Hello! Thank you for developing such an amazing tool!

I noticed that for the same reads, the molecular/reference coordinates for nucleosomes and MSPs in the ft extract bed are different from the coordinates returned when retrieving the tags with pysam.

Environment:

OS: Rocky Linux release 8.9 (Green Obsidian)
fibertools-rs: 0.4.2
pysam: 0.22.0

Commands used:

!samtools view -b -h --threads $cores $pacbio_bam chr1:226672641-226673664 > view.bam

!ft add-nucleosomes --threads $cores view.bam nuc_msp.bam

!ft fire --threads $cores nuc_msp.bam nuc_msp_fire.bam

!samtools index -b nuc_msp_fire.bam

!ft extract --simplify --threads $cores nuc_msp_fire.bam --all ft_tag.bed.gz

tag_bed_df = pd.read_csv(
    "ft_tag.bed.gz",
    sep='\t',
    usecols=['#ct', 'st', 'en', 'fiber', 'nuc_starts', 'nuc_lengths', 'ref_nuc_starts', 'ref_nuc_lengths', 'msp_starts', 'msp_lengths', 'ref_msp_starts', 'ref_msp_lengths', 'fire']
)

print(f"\nNUC molecular coordinates: {tag_bed_df[tag_bed_df['fiber'] == 'm64279e_230927_104521/147325792/ccs']['nuc_starts'].item()}")
print(f"\nNUC reference coordinates: {tag_bed_df[tag_bed_df['fiber'] == 'm64279e_230927_104521/147325792/ccs']['ref_nuc_starts'].item()}")
print(f"\nMSP molecular coordinates: {tag_bed_df[tag_bed_df['fiber'] == 'm64279e_230927_104521/147325792/ccs']['msp_starts'].item()}")
print(f"\nMSP reference coordinates: {tag_bed_df[tag_bed_df['fiber'] == 'm64279e_230927_104521/147325792/ccs']['ref_msp_starts'].item()}")

import pysam
import numpy as np

bamfile = pysam.AlignmentFile('nuc_msp_fire.bam', 'rb')

for index, read in enumerate(bamfile.fetch(region='chr1:226672641-226673664')):

    if read.query_name == 'm64279e_230927_104521/147325792/ccs':
        print(f'read start: {read.query_alignment_start}')
        print(f'read end: {read.query_alignment_end}')
        print(f'reference start: {read.reference_start}')
        print(f'reference end: {read.reference_end}')
        if read.has_tag('ns'):
            print(f"\nNUC molecular coordinates: {list(np.array(read.get_tag('ns')))}")
            print(f"\nNUC reference coordinates: {list(np.array(read.get_tag('ns')) + read.reference_start)}")

        if read.has_tag('as'):
            print(f"\nMSP molecular coordinates: {list(np.array(read.get_tag('as')))}")
            print(f"\nMSP reference coordinates: {list(np.array(read.get_tag('as')) + read.reference_start)}")

Screenshots:

view.bam can be downloaded with:

wget -O view.bam "https://drive.google.com/uc?export=download&id=1h3Zn_neSSnKGJKsZXTLVig8JrgpAFOJs"

mrvollger commented 5 months ago

Re the reference coordinates, the following logic is oversimplified for a couple of reasons

list(np.array(read.get_tag('ns')) + read.reference_start

but primarily because a read can align with insertions and deletions which creates an inconsistent offset which ft accounts for.

Re the molecular coordinates ft extract --all reports the positions of m6A methylation relative to the sequence reported in ft extract --all, and this sequence will be reverse complemented when aligned to the reverse strand so as to have everything in reference orientation. However, the tags contain the positions of the methylations in the original read with no consideration to reference (really, the only valid way to store them that can survive realignment). So, I'd expect a difference on reverse-strand-aligned reads and no difference on forward-aligned reads.

If you want a python interface with the results of fibertools I'd recommend: https://py-ft.readthedocs.io/en/latest/index.html though it is still very early days.

mrvollger commented 5 months ago

Please reopen if you find something that is not addressed here.

fiberseq / fibertools-rs

ft extract and pysam return different molecular/reference coordinates for the same read #53