Closed StephanieBohaczuk closed 9 months ago
In the python list I expect that you will find that many of the values are None. That's because the same index in starts and reference_starts reference the same base mod but sometimes there isn't a lift over so that value has to be none to keep the index matched.
I modified my code to check for that and the numbers are still puzzling to me. Maybe I'm missing something?
It does look like all of these coordinates point to C or G in the reference, so that's good.
from pyfaidx import Fasta
hg38_fasta="/mmfs1/gscratch/stergachislab/assemblies/hg38.analysisSet.fa"
edited_bam_f="/mmfs1/gscratch/stergachislab/bohaczuk/analysis/hbg_lieber/ft_reruns/hg38/split_by_edit/hg38_edited2x_55_m6afil.bam"
ref=Fasta(hg38_fasta)
edited_fiberbam=pyft.Fiberbam(edited_bam_f)
for fiber in edited_fiberbam.fetch ("chr11", 5249905, 5249911):
# Count the number of CpG sites that are not None
num_counts=0
cpg_labeled_base_genome=''
for cpg in fiber.cpg.reference_starts:
if type(cpg)==int:
num_counts+=1
# Get the base identity at the CpG coordinate from hg38
base=ref["chr11"][cpg:cpg+1]
cpg_labeled_base_genome += str(base)
# Get the base identity at the CpG coordinate from the read
cpg_liftref=fiber.lift_query_positions([x for x in fiber.cpg.reference_starts if type(x)==int])
cpg_labeled_base_read=""
for coordinate in cpg_liftref:
cpg_labeled_base_read += fiber.seq[coordinate]
print ("all_cpg_count:" + fiber.qname + '\t' + str(len(fiber.cpg.reference_starts)))
print("int_cpg_count:" + fiber.qname + "\t" + str(num_counts))
print(fiber.cpg.reference_starts)
print("genome_bases: ", cpg_labeled_base_genome)
print ("read_bases: ", cpg_labeled_base_read)
#Output:
all_cpg_count:m54329U_221022_135447/43450907/ccs 126
int_cpg_count:m54329U_221022_135447/43450907/ccs 126
[5234010, 5234372, 5234493, 5234695, 5235238, 5235616, 5235913, 5235957, 5236017, 5236022, 5236228, 5236413, 5236426, 5236468, 5237025, 5237034, 5237104, 5237108, 5237198, 5237236, 5237780, 5238043, 5238071, 5238103, 5238222, 5238706, $238787, 5238981, 5240035, 5240320, 5240547, 5240710, 5240860, 5240921, 5241029, 5241480, 5241759, 5241803, 5241897, 5242201, 5242234, 5242505, 5242640, 5242665, 5242776, 5243083, 5243125, 5243183, 5243203, 5243382, 5243432, 5243777, 52$3914, 5243924, 5243984, 5244035, 5244196, 5244228, 5244230, 5244268, 5244292, 5244316, 5244402, 5244409, 5244456, 5244458, 5244462, 5244754, 5244761, 5244889, 5244907, 5244916, 5244961, 5245294, 5245482, 5245565, 5246212, 5246236, 5246$29, 5246493, 5246652, 5246659, 5246689, 5246774, 5247009, 5247079, 5247252, 5247286, 5247309, 5247491, 5248036, 5248077, 5248444, 5248462, 5248734, 5248750, 5248772, 5248774, 5248776, 5248778, 5248780, 5249027, 5249186, 5249325, 524980$, 5249839, 5249850, 5249905, 5249908, 5249969, 5250017, 5250107, 5250201, 5250324, 5250475, 5250537, 5250677, 5250808, 5250882, 5250923, 5251089, 5251289, 5251432, 5251451, 5251545, 5251618]
genome_bases: CCCCCccccccCCCctcctcCcccccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccccCCCCCCCCCccccccccCCccccCCCCCccccccCCCCCCCCTCCCCCCCCCTCcccCc
read_bases: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
all_cpg_count:m54329U_221022_135447/88932698/ccs 97
int_cpg_count:m54329U_221022_135447/88932698/ccs 97
[5237780, 5238043, 5238071, 5238103, 5238222, 5238706, 5238787, 5238981, 5240035, 5240320, 5240547, 5240710, 5240860, 5240921, 5241029, 5241480, 5241759, 5241803, 5241897, 5242201, 5242234, 5242505, 5242640, 5242665, 5242776, 5243083, $243125, 5243183, 5243203, 5243382, 5243432, 5243777, 5243914, 5243924, 5243984, 5244035, 5244196, 5244228, 5244230, 5244268, 5244292, 5244316, 5244402, 5244409, 5244456, 5244458, 5244462, 5244754, 5244761, 5244889, 5244907, 5244916, 52$4961, 5245294, 5245482, 5245565, 5246212, 5246236, 5246329, 5246493, 5246652, 5246659, 5246689, 5246774, 5247009, 5247079, 5247252, 5247286, 5247309, 5247491, 5248036, 5248077, 5248444, 5248462, 5248734, 5248750, 5248772, 5248774, 5248776, 5248778, 5248780, 5249027, 5249186, 5249325, 5249806, 5249839, 5249850, 5249905, 5249908, 5249969, 5250017, 5250107, 5250201, 5250324, 5250475, 5250537, 5250677]
genome_bases: CcccccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccccCCCCCCCCCccccccccCCccccCCCCCccccccCCCCCCCCTCCCCCCC
read_bases: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
I checked and confirmed multiple ways your ft extract
calculations. So something may be wrong with pyft...
Can you do another Python check for me? I am wondering if maybe I am encoding those as '-1's and not None
s.
Actually, I can do it, you gave me all the paths
I figured it out, when running ft extract
there is a default ML threshold of 125
So If I set that threshold to zero then I get the right number.
I need to think about how to handle this to be more consistent. But I believe the same quality scores are available to you to filter on in pyft now?
Nevermind, the qual scores are not exposed in pyft... This is an issue I need to fix. Thanks for the report.
Thanks for looking into it! Let me know if I can run anything to help.
Oops, I missed it initially, but it is there to use:
https://py-ft.readthedocs.io/en/latest/#pyft.Basemods
There is an fiber.cpg.ml
that shares the index of fiber.cpg.reference_starts
.
I noticed that I'm getting a lot more cpgs on the same read with py-ft fiber.cpg.reference_starts (version 0.1.10) as compared to ft extract --reference --cpg (version 0.3.6). Any idea what could be happening?
For example,