fiberseq / fibertools-rs

Tools for fiberseq data written in rust.
https://fiberseq.github.io/fibertools/fibertools.html
42 stars 5 forks source link

py-ft fiber.cpg and ft extract --cpg producing different number of blocks #34

Closed StephanieBohaczuk closed 9 months ago

StephanieBohaczuk commented 9 months ago

I noticed that I'm getting a lot more cpgs on the same read with py-ft fiber.cpg.reference_starts (version 0.1.10) as compared to ft extract --reference --cpg (version 0.3.6). Any idea what could be happening?

For example,


ft extract --reference --cpg - \
/mmfs1/gscratch/stergachislab/bohaczuk/analysis/hbg_lieber/ft_reruns/hg38/split_by_edit/hg38_edited2x_55_m6afil.bam | \
head -n 2| cut -f 4,10

m54329U_221022_135447/43450907/ccs      86
m54329U_221022_135447/88932698/ccs      59

#but in python:
edited_bam_f="/mmfs1/gscratch/stergachislab/bohaczuk/analysis/hbg_lieber/ft_reruns/hg38/split_by_edit/hg38_edited2x_55_m6afil.bam"
edited_fiberbam=pyft.Fiberbam(edited_bam_f)
for fiber in edited_fiberbam.fetch ("chr11", 5249905, 5249911):
        print (fiber.qname + '\t' + str(len(fiber.cpg.reference_starts)))

m54329U_221022_135447/43450907/ccs      126
m54329U_221022_135447/88932698/ccs      97
mrvollger commented 9 months ago

In the python list I expect that you will find that many of the values are None. That's because the same index in starts and reference_starts reference the same base mod but sometimes there isn't a lift over so that value has to be none to keep the index matched.

StephanieBohaczuk commented 9 months ago

I modified my code to check for that and the numbers are still puzzling to me. Maybe I'm missing something?

It does look like all of these coordinates point to C or G in the reference, so that's good.


from pyfaidx import Fasta

hg38_fasta="/mmfs1/gscratch/stergachislab/assemblies/hg38.analysisSet.fa"
edited_bam_f="/mmfs1/gscratch/stergachislab/bohaczuk/analysis/hbg_lieber/ft_reruns/hg38/split_by_edit/hg38_edited2x_55_m6afil.bam"

ref=Fasta(hg38_fasta)

edited_fiberbam=pyft.Fiberbam(edited_bam_f)

for fiber in edited_fiberbam.fetch ("chr11", 5249905, 5249911):
    # Count the number of CpG sites that are not None
    num_counts=0
    cpg_labeled_base_genome=''
    for cpg in fiber.cpg.reference_starts:
        if type(cpg)==int:
            num_counts+=1
            # Get the base identity at the CpG coordinate from hg38
            base=ref["chr11"][cpg:cpg+1]
            cpg_labeled_base_genome += str(base)

    # Get the base identity at the CpG coordinate from the read
    cpg_liftref=fiber.lift_query_positions([x for x in fiber.cpg.reference_starts if type(x)==int])
    cpg_labeled_base_read=""
    for coordinate in cpg_liftref:
        cpg_labeled_base_read += fiber.seq[coordinate]

    print ("all_cpg_count:" + fiber.qname + '\t' + str(len(fiber.cpg.reference_starts)))
    print("int_cpg_count:" + fiber.qname + "\t" + str(num_counts))
    print(fiber.cpg.reference_starts)
    print("genome_bases: ", cpg_labeled_base_genome)
    print ("read_bases: ", cpg_labeled_base_read)

#Output:
all_cpg_count:m54329U_221022_135447/43450907/ccs        126
int_cpg_count:m54329U_221022_135447/43450907/ccs        126
[5234010, 5234372, 5234493, 5234695, 5235238, 5235616, 5235913, 5235957, 5236017, 5236022, 5236228, 5236413, 5236426, 5236468, 5237025, 5237034, 5237104, 5237108, 5237198, 5237236, 5237780, 5238043, 5238071, 5238103, 5238222, 5238706, $238787, 5238981, 5240035, 5240320, 5240547, 5240710, 5240860, 5240921, 5241029, 5241480, 5241759, 5241803, 5241897, 5242201, 5242234, 5242505, 5242640, 5242665, 5242776, 5243083, 5243125, 5243183, 5243203, 5243382, 5243432, 5243777, 52$3914, 5243924, 5243984, 5244035, 5244196, 5244228, 5244230, 5244268, 5244292, 5244316, 5244402, 5244409, 5244456, 5244458, 5244462, 5244754, 5244761, 5244889, 5244907, 5244916, 5244961, 5245294, 5245482, 5245565, 5246212, 5246236, 5246$29, 5246493, 5246652, 5246659, 5246689, 5246774, 5247009, 5247079, 5247252, 5247286, 5247309, 5247491, 5248036, 5248077, 5248444, 5248462, 5248734, 5248750, 5248772, 5248774, 5248776, 5248778, 5248780, 5249027, 5249186, 5249325, 524980$, 5249839, 5249850, 5249905, 5249908, 5249969, 5250017, 5250107, 5250201, 5250324, 5250475, 5250537, 5250677, 5250808, 5250882, 5250923, 5251089, 5251289, 5251432, 5251451, 5251545, 5251618]
genome_bases:  CCCCCccccccCCCctcctcCcccccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccccCCCCCCCCCccccccccCCccccCCCCCccccccCCCCCCCCTCCCCCCCCCTCcccCc
read_bases:  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
all_cpg_count:m54329U_221022_135447/88932698/ccs        97
int_cpg_count:m54329U_221022_135447/88932698/ccs        97
[5237780, 5238043, 5238071, 5238103, 5238222, 5238706, 5238787, 5238981, 5240035, 5240320, 5240547, 5240710, 5240860, 5240921, 5241029, 5241480, 5241759, 5241803, 5241897, 5242201, 5242234, 5242505, 5242640, 5242665, 5242776, 5243083, $243125, 5243183, 5243203, 5243382, 5243432, 5243777, 5243914, 5243924, 5243984, 5244035, 5244196, 5244228, 5244230, 5244268, 5244292, 5244316, 5244402, 5244409, 5244456, 5244458, 5244462, 5244754, 5244761, 5244889, 5244907, 5244916, 52$4961, 5245294, 5245482, 5245565, 5246212, 5246236, 5246329, 5246493, 5246652, 5246659, 5246689, 5246774, 5247009, 5247079, 5247252, 5247286, 5247309, 5247491, 5248036, 5248077, 5248444, 5248462, 5248734, 5248750, 5248772, 5248774, 5248776, 5248778, 5248780, 5249027, 5249186, 5249325, 5249806, 5249839, 5249850, 5249905, 5249908, 5249969, 5250017, 5250107, 5250201, 5250324, 5250475, 5250537, 5250677]
genome_bases:  CcccccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccccCCCCCCCCCccccccccCCccccCCCCCccccccCCCCCCCCTCCCCCCC
read_bases:  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
mrvollger commented 9 months ago

I checked and confirmed multiple ways your ft extract calculations. So something may be wrong with pyft...

Can you do another Python check for me? I am wondering if maybe I am encoding those as '-1's and not Nones.

mrvollger commented 9 months ago

Actually, I can do it, you gave me all the paths

mrvollger commented 9 months ago

I figured it out, when running ft extract there is a default ML threshold of 125

image

So If I set that threshold to zero then I get the right number.

image

I need to think about how to handle this to be more consistent. But I believe the same quality scores are available to you to filter on in pyft now?

mrvollger commented 9 months ago

Nevermind, the qual scores are not exposed in pyft... This is an issue I need to fix. Thanks for the report.

StephanieBohaczuk commented 9 months ago

Thanks for looking into it! Let me know if I can run anything to help.

mrvollger commented 9 months ago

Oops, I missed it initially, but it is there to use: https://py-ft.readthedocs.io/en/latest/#pyft.Basemods There is an fiber.cpg.ml that shares the index of fiber.cpg.reference_starts.