genomic position - Githubissues

acarmas1 commented 1 year ago

Hello,

I have used m6anet to identified the m6a modification in direct RNA reads of honey bees, and I got I table like this:

I added the column id_pos.

I want to see the distribution of my m6A sites across 5' UTR CDS and 3' UTR, I found the code used on your article, which is in the codeocean clapsule. The table used to create the figure needs a transcript annotations file and a .csv file with genomic position. However, when running m6anet the .csv file only has transcript_positions.

Is there anyway to obtain the genomic positions as well?, and also how can I create the transcript annotation file that contains the end of the 5' utr end of CDS and end tx.?

Thank you and I'm looking forward to hearing from you.

Best, Camila

chrishendra93 commented 1 year ago

hi @acarmas1 , sorry for the delay in my reply. I believe you have replied to the issue in [here](https://github.com/GoekeLab/m6anet/issues/54#issuecomment-1413163032 and you encountered an error which I am not quite sure about unless you share with me the function that you used and the full error log?

Thanks!

acarmas1 commented 1 year ago

Hi Chris, no problem

Also sorry for the late response. On this folder I added the code I run (gen_pos.sbatch), the error I got (out_genomic_position) and the files I used as inputs, which are in .fasta .gtf and .txt format. https://1drv.ms/u/s!AokqkR3muxL0jvIA82csbixN-ANRcQ?e=KWig3S

I'll appreciate your help and thank you.

chrishendra93 commented 1 year ago

hi @acarmas1 seems like a syntax error on gen_pos.py, specifically the last line, your dictionary is missing one curly braces at the end

acarmas1 commented 1 year ago

Thank you,

By adding the } at the end I was able to run it, but now I'm getting this error:

I think something is wrong with the exon_number argument of the GTF file.

chrishendra93 commented 1 year ago

hi @acarmas1, may I know which gtf file you are using? It might be formatted differently

acarmas1 commented 1 year ago

Sure, on this link you can find the gtf file I'm using https://1drv.ms/u/s!AokqkR3muxL0jvIA82csbixN-ANRcQ?e=KWig3S

acarmas1 commented 1 year ago

Please can someone help me with this

chrishendra93 commented 1 year ago

hi @acarmas1, sorry for the delay

I looked at your GTF file and it seems that the problem arises because your gtf file format is a bit different than what we used to deal with. I modified your gen_pos.py, try this out, t2g_dict will give transcript_position -> genomic_position for every transcript in your all_transcripts.txt file

`#!/usr/bin/env python

def t2g(tx_id, fasta_dict, gtf_dict):
    t2g_dict = {}
    tx_seq = fasta_dict[tx_id.split(".")[0]]
    tx_contig = gtf_dict[tx_id]['chr']
    g_id = gtf_dict[tx_id]['g_id']
    if tx_seq is None:
        return [], []

    for exon_num in range(len(gtf_dict[tx_id]['exon'])):
        g_interval = gtf_dict[tx_id]['exon'][exon_num]
        tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
        for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
            dis_from_start = g_pos - g_interval[0]
            if gtf_dict[tx_id]['strand'] == "+":
                tx_pos = tx_interval[0] + dis_from_start
            elif gtf_dict[tx_id]['strand'] == "-":
                tx_pos = tx_interval[1] - dis_from_start
            if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]): 
                kmer = 'XXXXX'
            else:
                kmer = tx_seq[tx_pos-2:tx_pos+3]
            t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
    return t2g_dict

def readFasta(infasta):
    fasta=open(infasta,"r")
    entries=""
    for ln in fasta:
        entries+=ln
    entries=entries.split(">")
    dict={}
    for entry in entries:
        entry=entry.split("\n")

        if len(entry[0].split()) > 0:
            id=entry[0].split()[0].split(".")[0]
            seq="".join(entry[1:])
            dict[id]=seq
    return dict

def readGTF(gtf_or_gff):
    gtf=open(gtf_or_gff,"r")
    dict,is_gff={},0
    for ln in gtf:
        if not ln.startswith("#"):
            ln=ln.strip("\n").split("\t")
            if is_gff == 0:
                if ln[-1].startswith("ID") or ln[-1].startswith("Parent"):
                    is_gff = 1
                else:
                    is_gff = -1

            if is_gff < 0:
                if ln[2] == "transcript" or ln[2] == "exon":
                    chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
                    attrList=ln[-1].split(";")
                    attrDict={}
                    for k in attrList:
                        p=k.strip().split(" ")
                        if len(p) == 2:
                            attrDict[p[0]]=p[1].strip('\"')
                    ##tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]
                    ##g_id=ln[-1].split('gene_id "')[1].split('";')[0]
                    tx_id = attrDict["transcript_id"]
                    g_id = attrDict["gene_id"]

                    if tx_id not in dict:
                        dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
                        if type not in dict[tx_id]:
                            if type in ("transcript", "exon"):
                                dict[tx_id][type]=[(start,end)]

                    else:
                        if type == "exon":
                            if type not in dict[tx_id]:
                                dict[tx_id][type]=[(start,end)]
                            else:
                                dict[tx_id][type].append((start,end))

            if is_gff > 0:
                if ln[2] == "exon" or ln[2] == "mRNA":
                    chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
                    tx_id=ln[-1].split('transcript:')[1].split(';')[0]
                    if ln[2] == "mRNA":
                        type="transcript"
                    if tx_id not in dict:
                        dict[tx_id]={'chr':chr,'strand':ln[6]}
                        if type == "transcript":
                            dict[tx_id][type]=(start,end)
                        if type == "exon":
                            dict[tx_id][type]=[(start,end)]
                    else:
                        if type == "transcript" and type not in dict[tx_id]:
                            dict[tx_id][type]=(start,end)
                        if type == "exon":
                            if type not in dict[tx_id]:
                                dict[tx_id][type]=[(start,end)]
                            else:
                                dict[tx_id][type].append((start,end))
    #convert genomic positions to tx positions
    if is_gff < 0:
        for id in dict:
            tx_pos,tx_start=[],0
            if 'exon' not in dict[id]:
                print(id)
                continue
            for pair in dict[id]["exon"]:
                tx_end=pair[1]-pair[0]+tx_start
                tx_pos.append((tx_start,tx_end))
                tx_start=tx_end+1
            dict[id]['tx_exon']=tx_pos
    else:
        for id in dict:
            tx_pos,tx_start=[],0
            if dict[id]["strand"] == "-":
                dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
            for pair in dict[id]["exon"]:
                tx_end=pair[1]-pair[0]+tx_start
                tx_pos.append((tx_start,tx_end))
                tx_start=tx_end+1
            dict[id]['tx_exon']=tx_pos
    return (dict,is_gff)

def readalltranscript(intranscript):
    all_transcripts = []
    i = 0
    with open(intranscript,"r") as f:
        for line in f:
            if i == 0: # Skip the first line
                i+=1
                continue
            else:
                all_transcripts.append(line.strip("\n"))
    return all_transcripts  

fasta_dict = readFasta("/projects/dsn001/camila/m6A/genomes/ApisMellifera_RNA.fasta")
gtf_dict, _ = readGTF("/projects/dsn001/camila/m6A/genomes/GCF_003254395.2_Amel_HAv3.1_genomic.gtf")
all_transcripts = readalltranscript("/projects/dsn001/camila/m6A/m6anet_analysis/pooled_rep/Rfiltered_data/trans2.txt")
t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts

`

acarmas1 commented 1 year ago

Thank you, I tried to run the code, but I got this error

I saw this error EOF implies that the interpreter has reached the end of our program before executing all the code.

chrishendra93 commented 1 year ago

hi @acarmas1 , yeah you encountered that error because the t2g dict does not have a closing bracket. Try

t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts}

acarmas1 commented 1 year ago

By running it with the closing brackets I didn't get any errors, but I also didn't get any new files or results.

chrishendra93 commented 1 year ago

hi @acarmas1 , the script gives you a Python dictionary that provides a mapping between each transcriptomic position to a genomic position in t2g_dict variable. Will you be able to use this to map the position in your data yourself? I'm a bit packed with other projects recently so I might take a while to help with code implementation

acarmas1 commented 1 year ago

The problem is that after running the script I didn't get any t2g_dict file. Therefore, I was wondering if you could help me please. After getting the dictionary, yes run the mapping by myself.

chrishendra93 commented 1 year ago

yeah the dictionary is not saved as a file, you can use it directly to map your position. If you want to save it, then you have to save the t2g_dict within that script into a file

chrishendra93 commented 1 year ago

hi @acarmas1 if you're more familiar with R, you can also check this discussion thread

acarmas1 commented 1 year ago

Yes, I'm more familiar with R. I'll give it a try. Thanks for sharing the discussion thread.

GoekeLab / m6anet

genomic position #76