Closed haotianteng closed 1 year ago
Hi, you're supposed to get more than 213 mapped sites Did you map the transcript coordinate after running dataprep? How many sites did you get from the transcript-mapped reads alone?
I got >50K records in data.result.csv, then I used the following code from previous issue to map the transcriptomic coordinate to the genomics coordinate:
def t2g(tx_id, fasta_dict, gtf_dict):
t2g_dict = {}
if tx_id not in fasta_dict.keys():
return [], []
tx_seq = fasta_dict[tx_id]
tx_contig = gtf_dict[tx_id]['chr']
g_id = gtf_dict[tx_id]['g_id']
if tx_seq is None:
return [], []
for exon_num in range(len(gtf_dict[tx_id]['exon'])):
g_interval = gtf_dict[tx_id]['exon'][exon_num]
tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
dis_from_start = g_pos - g_interval[0]
if gtf_dict[tx_id]['strand'] == "+":
tx_pos = tx_interval[0] + dis_from_start
elif gtf_dict[tx_id]['strand'] == "-":
tx_pos = tx_interval[1] - dis_from_start
if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]):
kmer = 'XXXXX'
else:
kmer = tx_seq[tx_pos-2:tx_pos+3]
t2g_dict[tx_pos] = (tx_contig, g_id, g_pos) # tx.contig is chromosome.
return t2g_dict
def readFasta(transcript_fasta_paths_or_urls):
fasta=open(transcript_fasta_paths_or_urls,"r")
entries=""
for ln in fasta:
entries+=ln
entries=entries.split(">")
dict={}
for entry in entries:
entry=entry.split("\n")
if len(entry[0].split()) > 0:
id=entry[0].split()[0].split(".")[0]
seq="".join(entry[1:])
dict[id]=seq
return dict
def readGTF(gtf_path_or_url):
gtf=open(gtf_path_or_url,"r")
dict={}
gene_transcript_dict={}
for ln in gtf:
if not ln.startswith("#"):
ln=ln.strip("\n").split("\t")
if ln[2] == "transcript" or ln[2] in ("exon", "start_codon", "stop_codon", "CDS",
"three_prime_utr", "five_prime_utr"):
chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
attrList=ln[-1].split(";")
attrDict={}
for k in attrList:
p=k.strip().split(" ")
if len(p) == 2:
attrDict[p[0]]=p[1].strip('\"')
tx_id = attrDict["transcript_id"]
g_id = attrDict["gene_id"]
gene_transcript_dict[g_id] = tx_id
if tx_id not in dict:
dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
if type not in dict[tx_id]:
if type == "transcript":
dict[tx_id][type]=(start,end)
else:
if type == 'CDS':
info = (start, end, int(attrDict['exon_number']))
else:
info = (start, end)
if type not in dict[tx_id]:
dict[tx_id][type]=[info]
else:
dict[tx_id][type].append(info)
#convert genomic positions to tx positions
for id in dict:
tx_pos,tx_start=[],0
for pair in dict[id]["exon"]:
tx_end=pair[1]-pair[0]+tx_start
tx_pos.append((tx_start,tx_end))
tx_start=tx_end+1
dict[id]['tx_exon']=tx_pos
return dict,gene_transcript_dict
if __name__ == "__main__":
import os
import pandas as pd
import pickle
SCRATCH = os.environ['SCRATCH']
ref_folder = os.path.join(SCRATCH, 'NA12878_RNA_IVT/GRCh38_transcript_ensembel')
print("Reading reference fasta files...")
fasta = readFasta(os.path.join(ref_folder,'Homo_sapiens.GRCh38.cds.all.fa'))
print("Reading reference gtf files...")
gtf,id_map = readGTF(os.path.join(ref_folder,'Homo_sapiens.GRCh38.109.gtf'))
print("Reading transcript names...")
df = pd.read_csv(f"{SCRATCH}/Xron_Project/m6A_site_m6Anet_DRACH_HEK293T.csv")
gene_ids = df['gene_id'].unique()
tx_names = [id_map[gene_id] for gene_id in gene_ids if gene_id in id_map]
print("Creating t2g dictionary...")
t2g_dict = {tx:t2g(tx, fasta, gtf) for tx in tx_names}
print("Saving t2g dictionary...")
with open(f"{SCRATCH}/Xron_Project/Benchmark/HEK293T/t2g_dict.pkl", "wb+") as f:
pickle.dump(t2g_dict, f)
Then comparing the genomics coordinate with this file: m6A_site_m6Anet_DRACH_HEK293T_subset500genes.csv
This is a subset file containing 500 genes; after comparing and selecting the same genomics coordinate, I only get 213 sites.
m6A_site_m6Anet_DRACH_HEK293T.csv This is the HEK239T record file from your paper.
Hi @haotianteng!
Did you start it from running guppy with raw fast5 data, or did you download their fastq? Could you please share how you generate your data? I also tried to reproduce m6anet data but my result is very different from their supplementary data. It looks like you data is much close to theirs. I only got ~3k sites in my raw data.
Thank you!
hi @haotianteng, I was trying to map your dataset using the reference that I used for the HEK293T record and I could not because some of the transcript position is not in the t2g_dict that I created. Have you tried using the reference from the SG-next gihub? Try using Homo_sapiens.GRCh38.cdna.ncrna.fa and Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")
Thanks for the reply! My reference genome is downloaded from ensembl, okay I can have a try with the genome files. Would you be able to share your t2g_dict with me? Maybe by exporting into a toml file? @kwonej0617 I download the fastq from the repository https://www.ebi.ac.uk/ena/browser/view/PRJEB40872 Used the wild type dataset rep1. Basecall using Guppy with high-ac configuration. And then aligned with minimap2
Hi @chrishendra93 I only found Homo_sapiens.GRCh38.91.gtf in SG-next github. Is that the correct gtf file to use?
hi @haotianteng, yeah I used the bam file from sg-nex so it should be mapped using that gtf. Personally I used https://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf.gz but I think the extra annotations from that file should not affect it
I got 0.812 AUC-ROC with 94818 entry, I guess it's okay? Although it's still slightly lower than the report 0.834 AUC-ROC.
hi @haotianteng , that's better than before. Did you try averaging the probability modified per genomic position as well?
I just used the data.result output, I think it's already a site-level probability data, can you elaborate on how to average the probability?
hi @haotianteng, the probability is site-level but it is on the transcriptomic level. The label data is on the genomic level, which means that multiple transcript positions that are mapped to the same genomic position will be assigned the same label. In the paper, we group those positions, i.e, by grouping on the gene id and genomic position and taking the average of the probability.
Dear m6anet author, I have difficulty reproducing the HEK293T result from Figure 1D, I run the m6anet on HEK293T-WT-rep1 data, and map the transcript coordinate to the genomic coordinate, but was only able to get 213 mapped sites. And I can't get the same AUCROC from these 213 sites. What part could be wrong? I am using the GRCh38 human reference genome from Ensembl.
Best, Teng