GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
100 stars 17 forks source link

Reproducing Fig 2d,e; request for transcriptome2genome.py for dummies #125

Open hengjwj opened 11 months ago

hengjwj commented 11 months ago

Hi @chrishendra93,

I'm trying to reproduce part of Fig. 2d and 2e using the HCT116 and VIRC models provided in Git to verify that my installation of m6anet is intact and that I can run it as intended. Can I ask if I'm on the right track?

I've downloaded BAM and BLOW5 files for SGNex_Hek293T_directRNA_replicate1_run1 and SGNex_Hct116_directRNA_replicate2_run1 to do this and plan to compare the m6anet inference output on these datasets to the results found in supplementary table 3 and 4 (test on HCT116 and HEK293T respectively), filtering for the 500 genes that have been marked "Test" under set_type.

So far, the different steps seem to have run fine without returning any error or warning messages. m6anet inference on SGNex_Hct116_directRNA_replicate2_run1 returned 16,112 records in data.site_proba.csv. Of these, 2,538 records corresponded to the 500 genes marked for testing. I'm not familiar with Python so I used the R code mentioned in #104 for transcriptome-to-genome (t2g) conversion. However, I found only four records that had the same genomic positions as that found in supplementary table 3 and I'm not sure what exactly is the problem (e.g. could be the R code, could be my installation of m6anet, etc.). Do you have something like a transcriptome2genome.py script for dummies (I imagine the input will be the data from m6anet inference and the FASTA and GTF files required) so I can rule out the t2g conversion? Or if you have other suggestions, please let me know!

Joel

hengjwj commented 11 months ago

To add, I tried the code you wrote in #76 but got a KeyError from the first term like so:

>>> fasta_dict = readFasta("/mnt/c/Users/USER/Desktop/m6anet_data/Homo_sapiens.GRCh38.cdna.ncrna.fa")

>>> gtf_dict, _ = readGTF("/mnt/c/Users/USER/Desktop/m6anet_data/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")

>>> all_transcripts = readalltranscript("/mnt/c/Users/USER/Desktop/m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/data.site_proba.csv")

>>> t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <dictcomp>
  File "<stdin>", line 4, in t2g
KeyError: 'ENST00000389680.2,103,81,0.4684189260005951,GAACA,0.3333333333333333'

This is the first line of all_transcripts:

>>> all_transcripts[0]
'ENST00000389680.2,103,81,0.4684189260005951,GAACA,0.3333333333333333'

I checked that the key can be found in fasta_dict and gtf_dict:

>>> fasta_dict['ENST00000389680']
'AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATACCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAGAAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC'

>>> gtf_dict['ENST00000389680']
{'chr': 'MT', 'g_id': 'ENSG00000211459', 'strand': '+', 'transcript': [(648, 1601)], 'exon': [(648, 1601)], 'tx_exon': [(0, 953)]}

Is there something missing/wrong in t2g? Below is what I copied. I suspect, but I'm not sure, that the problem lies in the tx_id.split lines:

def t2g(tx_id, fasta_dict, gtf_dict):
    t2g_dict = {}
    tx_seq = fasta_dict[tx_id.split(".")[0]]
    tx_contig = gtf_dict[tx_id]['chr']
    g_id = gtf_dict[tx_id]['g_id']
    if tx_seq is None:
        return [], []
    for exon_num in range(len(gtf_dict[tx_id]['exon'])):
        g_interval = gtf_dict[tx_id]['exon'][exon_num]
        tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
        for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
            dis_from_start = g_pos - g_interval[0]
            if gtf_dict[tx_id]['strand'] == "+":
                tx_pos = tx_interval[0] + dis_from_start
            elif gtf_dict[tx_id]['strand'] == "-":
                tx_pos = tx_interval[1] - dis_from_start
            if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]): 
                kmer = 'XXXXX'
            else:
                kmer = tx_seq[tx_pos-2:tx_pos+3]
            t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
    return t2g_dict
hengjwj commented 11 months ago

Hi @chrishendra93,

I modified the code you wrote in #76. I made one change to the t2g function and its input to fix the KeyError I got earlier, added a function to append the genomic position to the original m6anet output, and saved the output in CSV. Can I trouble you to help verify if it looks correct?

def t2g(tx_id, fasta_dict, gtf_dict):
    t2g_dict = {}
    tx_seq = fasta_dict[tx_id]
    tx_contig = gtf_dict[tx_id]['chr']
    g_id = gtf_dict[tx_id]['g_id']
    if tx_seq is None:
        return [], []
    for exon_num in range(len(gtf_dict[tx_id]['exon'])):
        g_interval = gtf_dict[tx_id]['exon'][exon_num]
        tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
        for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
            dis_from_start = g_pos - g_interval[0]
            if gtf_dict[tx_id]['strand'] == "+":
                tx_pos = tx_interval[0] + dis_from_start
            elif gtf_dict[tx_id]['strand'] == "-":
                tx_pos = tx_interval[1] - dis_from_start
            if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]): 
                kmer = 'XXXXX'
            else:
                kmer = tx_seq[tx_pos-2:tx_pos+3]
            t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
    return t2g_dict

def appendt2g(all_transcripts, t2g_dict, gtf_dict):
    all_transcripts_with_genomic = []
    for transcript_entry in all_transcripts:
        transcript_id, transcriptome_position, n_reads, probability_modified, kmer, mod_ratio = transcript_entry.split(',')
        genomic_position = t2g_dict[transcript_entry][int(transcriptome_position) - 1]
        g_id = gtf_dict[transcript_id.split(".")[0]]['g_id']
        updated_entry = f"{transcript_id},{transcriptome_position},{n_reads},{probability_modified},{kmer},{mod_ratio},{genomic_position},{g_id}"
        all_transcripts_with_genomic.append(updated_entry)
    return all_transcripts_with_genomic

import csv

fasta_dict = readFasta("m6anet_data/Homo_sapiens.GRCh38.cdna.ncrna.fa")
gtf_dict, _ = readGTF("m6anet_data/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")
all_transcripts = readalltranscript("m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/data.site_proba.csv")
t2g_dict = {tx: t2g(tx.split(".")[0], fasta_dict, gtf_dict) for tx in all_transcripts}
all_transcripts_with_genomic = appendt2g(all_transcripts, t2g_dict, gtf_dict)

with open('m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/m6anet_t2g.csv', 'w') as file:
    writer = csv.writer(file)
    for row in all_transcripts_with_genomic:
        writer.writerow(row.split(','))