Closed acarmas1 closed 1 year ago
hi @acarmas1 , sorry for the delay in my reply. I believe you have replied to the issue in [here](https://github.com/GoekeLab/m6anet/issues/54#issuecomment-1413163032 and you encountered an error which I am not quite sure about unless you share with me the function that you used and the full error log?
Thanks!
Hi Chris, no problem
Also sorry for the late response. On this folder I added the code I run (gen_pos.sbatch), the error I got (out_genomic_position) and the files I used as inputs, which are in .fasta .gtf and .txt format. https://1drv.ms/u/s!AokqkR3muxL0jvIA82csbixN-ANRcQ?e=KWig3S
I'll appreciate your help and thank you.
hi @acarmas1 seems like a syntax error on gen_pos.py, specifically the last line, your dictionary is missing one curly braces at the end
Thank you,
By adding the } at the end I was able to run it, but now I'm getting this error:
I think something is wrong with the exon_number argument of the GTF file.
hi @acarmas1, may I know which gtf file you are using? It might be formatted differently
Sure, on this link you can find the gtf file I'm using https://1drv.ms/u/s!AokqkR3muxL0jvIA82csbixN-ANRcQ?e=KWig3S
Please can someone help me with this
hi @acarmas1, sorry for the delay
I looked at your GTF file and it seems that the problem arises because your gtf file format is a bit different than what we used to deal with. I modified your gen_pos.py, try this out, t2g_dict will give transcript_position -> genomic_position for every transcript in your all_transcripts.txt file
`#!/usr/bin/env python
def t2g(tx_id, fasta_dict, gtf_dict):
t2g_dict = {}
tx_seq = fasta_dict[tx_id.split(".")[0]]
tx_contig = gtf_dict[tx_id]['chr']
g_id = gtf_dict[tx_id]['g_id']
if tx_seq is None:
return [], []
for exon_num in range(len(gtf_dict[tx_id]['exon'])):
g_interval = gtf_dict[tx_id]['exon'][exon_num]
tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
dis_from_start = g_pos - g_interval[0]
if gtf_dict[tx_id]['strand'] == "+":
tx_pos = tx_interval[0] + dis_from_start
elif gtf_dict[tx_id]['strand'] == "-":
tx_pos = tx_interval[1] - dis_from_start
if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]):
kmer = 'XXXXX'
else:
kmer = tx_seq[tx_pos-2:tx_pos+3]
t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
return t2g_dict
def readFasta(infasta):
fasta=open(infasta,"r")
entries=""
for ln in fasta:
entries+=ln
entries=entries.split(">")
dict={}
for entry in entries:
entry=entry.split("\n")
if len(entry[0].split()) > 0:
id=entry[0].split()[0].split(".")[0]
seq="".join(entry[1:])
dict[id]=seq
return dict
def readGTF(gtf_or_gff):
gtf=open(gtf_or_gff,"r")
dict,is_gff={},0
for ln in gtf:
if not ln.startswith("#"):
ln=ln.strip("\n").split("\t")
if is_gff == 0:
if ln[-1].startswith("ID") or ln[-1].startswith("Parent"):
is_gff = 1
else:
is_gff = -1
if is_gff < 0:
if ln[2] == "transcript" or ln[2] == "exon":
chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
attrList=ln[-1].split(";")
attrDict={}
for k in attrList:
p=k.strip().split(" ")
if len(p) == 2:
attrDict[p[0]]=p[1].strip('\"')
##tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]
##g_id=ln[-1].split('gene_id "')[1].split('";')[0]
tx_id = attrDict["transcript_id"]
g_id = attrDict["gene_id"]
if tx_id not in dict:
dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
if type not in dict[tx_id]:
if type in ("transcript", "exon"):
dict[tx_id][type]=[(start,end)]
else:
if type == "exon":
if type not in dict[tx_id]:
dict[tx_id][type]=[(start,end)]
else:
dict[tx_id][type].append((start,end))
if is_gff > 0:
if ln[2] == "exon" or ln[2] == "mRNA":
chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
tx_id=ln[-1].split('transcript:')[1].split(';')[0]
if ln[2] == "mRNA":
type="transcript"
if tx_id not in dict:
dict[tx_id]={'chr':chr,'strand':ln[6]}
if type == "transcript":
dict[tx_id][type]=(start,end)
if type == "exon":
dict[tx_id][type]=[(start,end)]
else:
if type == "transcript" and type not in dict[tx_id]:
dict[tx_id][type]=(start,end)
if type == "exon":
if type not in dict[tx_id]:
dict[tx_id][type]=[(start,end)]
else:
dict[tx_id][type].append((start,end))
#convert genomic positions to tx positions
if is_gff < 0:
for id in dict:
tx_pos,tx_start=[],0
if 'exon' not in dict[id]:
print(id)
continue
for pair in dict[id]["exon"]:
tx_end=pair[1]-pair[0]+tx_start
tx_pos.append((tx_start,tx_end))
tx_start=tx_end+1
dict[id]['tx_exon']=tx_pos
else:
for id in dict:
tx_pos,tx_start=[],0
if dict[id]["strand"] == "-":
dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
for pair in dict[id]["exon"]:
tx_end=pair[1]-pair[0]+tx_start
tx_pos.append((tx_start,tx_end))
tx_start=tx_end+1
dict[id]['tx_exon']=tx_pos
return (dict,is_gff)
def readalltranscript(intranscript):
all_transcripts = []
i = 0
with open(intranscript,"r") as f:
for line in f:
if i == 0: # Skip the first line
i+=1
continue
else:
all_transcripts.append(line.strip("\n"))
return all_transcripts
fasta_dict = readFasta("/projects/dsn001/camila/m6A/genomes/ApisMellifera_RNA.fasta")
gtf_dict, _ = readGTF("/projects/dsn001/camila/m6A/genomes/GCF_003254395.2_Amel_HAv3.1_genomic.gtf")
all_transcripts = readalltranscript("/projects/dsn001/camila/m6A/m6anet_analysis/pooled_rep/Rfiltered_data/trans2.txt")
t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts
`
Thank you, I tried to run the code, but I got this error
I saw this error EOF implies that the interpreter has reached the end of our program before executing all the code.
hi @acarmas1 , yeah you encountered that error because the t2g dict does not have a closing bracket. Try
t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts}
By running it with the closing brackets I didn't get any errors, but I also didn't get any new files or results.
hi @acarmas1 , the script gives you a Python dictionary that provides a mapping between each transcriptomic position to a genomic position in t2g_dict variable. Will you be able to use this to map the position in your data yourself? I'm a bit packed with other projects recently so I might take a while to help with code implementation
The problem is that after running the script I didn't get any t2g_dict file. Therefore, I was wondering if you could help me please. After getting the dictionary, yes run the mapping by myself.
yeah the dictionary is not saved as a file, you can use it directly to map your position. If you want to save it, then you have to save the t2g_dict within that script into a file
hi @acarmas1 if you're more familiar with R, you can also check this discussion thread
Yes, I'm more familiar with R. I'll give it a try. Thanks for sharing the discussion thread.
Hello,
I have used m6anet to identified the m6a modification in direct RNA reads of honey bees, and I got I table like this:
I added the column id_pos.
I want to see the distribution of my m6A sites across 5' UTR CDS and 3' UTR, I found the code used on your article, which is in the codeocean clapsule. The table used to create the figure needs a transcript annotations file and a .csv file with genomic position. However, when running m6anet the .csv file only has transcript_positions.
Is there anyway to obtain the genomic positions as well?, and also how can I create the transcript annotation file that contains the end of the 5' utr end of CDS and end tx.?
Thank you and I'm looking forward to hearing from you.
Best, Camila