Reproducing HEK293T result

haotianteng commented 1 year ago

Dear m6anet author, I have difficulty reproducing the HEK293T result from Figure 1D, I run the m6anet on HEK293T-WT-rep1 data, and map the transcript coordinate to the genomic coordinate, but was only able to get 213 mapped sites. And I can't get the same AUCROC from these 213 sites. What part could be wrong? I am using the GRCh38 human reference genome from Ensembl.

Best, Teng

chrishendra93 commented 1 year ago

Hi, you're supposed to get more than 213 mapped sites Did you map the transcript coordinate after running dataprep? How many sites did you get from the transcript-mapped reads alone?

haotianteng commented 1 year ago

I got >50K records in data.result.csv, then I used the following code from previous issue to map the transcriptomic coordinate to the genomics coordinate:

def t2g(tx_id, fasta_dict, gtf_dict):
    t2g_dict = {}
    if tx_id not in fasta_dict.keys():
        return [], []
    tx_seq = fasta_dict[tx_id]
    tx_contig = gtf_dict[tx_id]['chr']
    g_id = gtf_dict[tx_id]['g_id']
    if tx_seq is None:
        return [], []

    for exon_num in range(len(gtf_dict[tx_id]['exon'])):
        g_interval = gtf_dict[tx_id]['exon'][exon_num]
        tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
        for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
            dis_from_start = g_pos - g_interval[0]
            if gtf_dict[tx_id]['strand'] == "+":
                tx_pos = tx_interval[0] + dis_from_start
            elif gtf_dict[tx_id]['strand'] == "-":
                tx_pos = tx_interval[1] - dis_from_start
            if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]): 
                kmer = 'XXXXX'
            else:
                kmer = tx_seq[tx_pos-2:tx_pos+3]
            t2g_dict[tx_pos] = (tx_contig, g_id, g_pos) # tx.contig is chromosome.
    return t2g_dict

def readFasta(transcript_fasta_paths_or_urls):
    fasta=open(transcript_fasta_paths_or_urls,"r")
    entries=""
    for ln in fasta:
        entries+=ln
    entries=entries.split(">")
    dict={}
    for entry in entries:
        entry=entry.split("\n")

        if len(entry[0].split()) > 0:
            id=entry[0].split()[0].split(".")[0]
            seq="".join(entry[1:])
            dict[id]=seq
    return dict

def readGTF(gtf_path_or_url):
    gtf=open(gtf_path_or_url,"r")
    dict={}
    gene_transcript_dict={}
    for ln in gtf:
        if not ln.startswith("#"):
            ln=ln.strip("\n").split("\t")
            if ln[2] == "transcript" or ln[2] in ("exon", "start_codon", "stop_codon", "CDS", 
                                                  "three_prime_utr", "five_prime_utr"):
                chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
                attrList=ln[-1].split(";")
                attrDict={}
                for k in attrList:
                    p=k.strip().split(" ")
                    if len(p) == 2:
                        attrDict[p[0]]=p[1].strip('\"')
                tx_id = attrDict["transcript_id"]
                g_id = attrDict["gene_id"]
                gene_transcript_dict[g_id] = tx_id
                if tx_id not in dict:
                    dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
                    if type not in dict[tx_id]:
                        if type == "transcript":
                            dict[tx_id][type]=(start,end)
                else:
                    if type == 'CDS':
                        info = (start, end, int(attrDict['exon_number']))
                    else:
                        info = (start, end)
                    if type not in dict[tx_id]:
                        dict[tx_id][type]=[info]
                    else:
                        dict[tx_id][type].append(info)

    #convert genomic positions to tx positions
    for id in dict:
        tx_pos,tx_start=[],0
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos
    return dict,gene_transcript_dict

if __name__ == "__main__":
    import os
    import pandas as pd
    import pickle
    SCRATCH = os.environ['SCRATCH']
    ref_folder = os.path.join(SCRATCH, 'NA12878_RNA_IVT/GRCh38_transcript_ensembel')
    print("Reading reference fasta files...")
    fasta = readFasta(os.path.join(ref_folder,'Homo_sapiens.GRCh38.cds.all.fa'))
    print("Reading reference gtf files...")
    gtf,id_map = readGTF(os.path.join(ref_folder,'Homo_sapiens.GRCh38.109.gtf'))
    print("Reading transcript names...")
    df = pd.read_csv(f"{SCRATCH}/Xron_Project/m6A_site_m6Anet_DRACH_HEK293T.csv")
    gene_ids = df['gene_id'].unique()
    tx_names = [id_map[gene_id] for gene_id in gene_ids if gene_id in id_map]
    print("Creating t2g dictionary...")
    t2g_dict = {tx:t2g(tx, fasta, gtf) for tx in tx_names}
    print("Saving t2g dictionary...")
    with open(f"{SCRATCH}/Xron_Project/Benchmark/HEK293T/t2g_dict.pkl", "wb+") as f:
        pickle.dump(t2g_dict, f)

data.result.csv

Then comparing the genomics coordinate with this file: m6A_site_m6Anet_DRACH_HEK293T_subset500genes.csv

This is a subset file containing 500 genes; after comparing and selecting the same genomics coordinate, I only get 213 sites.

haotianteng commented 1 year ago

m6A_site_m6Anet_DRACH_HEK293T.csv This is the HEK239T record file from your paper.

kwonej0617 commented 1 year ago

Hi @haotianteng!

Did you start it from running guppy with raw fast5 data, or did you download their fastq? Could you please share how you generate your data? I also tried to reproduce m6anet data but my result is very different from their supplementary data. It looks like you data is much close to theirs. I only got ~3k sites in my raw data.

Thank you!

chrishendra93 commented 1 year ago

hi @haotianteng, I was trying to map your dataset using the reference that I used for the HEK293T record and I could not because some of the transcript position is not in the t2g_dict that I created. Have you tried using the reference from the SG-next gihub? Try using Homo_sapiens.GRCh38.cdna.ncrna.fa and Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")

haotianteng commented 1 year ago

Thanks for the reply! My reference genome is downloaded from ensembl, okay I can have a try with the genome files. Would you be able to share your t2g_dict with me? Maybe by exporting into a toml file? @kwonej0617 I download the fastq from the repository https://www.ebi.ac.uk/ena/browser/view/PRJEB40872 Used the wild type dataset rep1. Basecall using Guppy with high-ac configuration. And then aligned with minimap2

haotianteng commented 1 year ago

Hi @chrishendra93 I only found Homo_sapiens.GRCh38.91.gtf in SG-next github. Is that the correct gtf file to use?

chrishendra93 commented 1 year ago

hi @haotianteng, yeah I used the bam file from sg-nex so it should be mapped using that gtf. Personally I used https://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf.gz but I think the extra annotations from that file should not affect it

haotianteng commented 1 year ago

I got 0.812 AUC-ROC with 94818 entry, I guess it's okay? Although it's still slightly lower than the report 0.834 AUC-ROC.

chrishendra93 commented 1 year ago

hi @haotianteng , that's better than before. Did you try averaging the probability modified per genomic position as well?

haotianteng commented 1 year ago

I just used the data.result output, I think it's already a site-level probability data, can you elaborate on how to average the probability?

chrishendra93 commented 1 year ago

hi @haotianteng, the probability is site-level but it is on the transcriptomic level. The label data is on the genomic level, which means that multiple transcript positions that are mapped to the same genomic position will be assigned the same label. In the paper, we group those positions, i.e, by grouping on the gene id and genomic position and taking the average of the probability.

GoekeLab / m6anet

Reproducing HEK293T result #96