Closed Mahnaz-Shemirani closed 10 months ago
Hi there. riboWaltz uses standard GTF files where, to my knowledge, the UTRs field are actually labelled with 3UTR and 5UTR. Moreover, the riboWaltz function _createannotation is based on makeTxDbFromGFF, fiveUTRsByTranscript and threeUTRsByTranscriptfrom (and others) from the GenomicFeatures package, which also take as input standard GTFs. Can you please send me the GTF you passed as input, or at least some rows containing the UTRs field?
Best Fabio
Hi, Thanks for replying. Here is an example of gtf file:
Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
I also tried , Lalb_Chr01 genome 5UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome 3UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Best Mahnaz
Hi Mahnaz, I was wrong, both the 5' UTR and 3' UTR fields are labeled with just "UTR", as you can see in the following chunk from one GTF I downloaded from GENCODE here
chr13 HAVANA transcript 100124852 100137690 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13 HAVANA exon 100124852 100124965 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13 HAVANA CDS 100124894 100124965 . + 0 gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13 HAVANA start_codon 100124894 100124896 . + 0 gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13 HAVANA CDS 100135689 100135733 . + 0 gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13 HAVANA stop_codon 100135734 100135736 . + 0 gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13 HAVANA exon 100137381 100137480 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 3; exon_id "ENSMUSE00000810281.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13 HAVANA UTR 100135734 100135739 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13 HAVANA UTR 100137381 100137480 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 3; exon_id "ENSMUSE00000810281.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13 HAVANA transcript 100124924 100131174 . + . gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000143937.1"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-207"; level 2; protein_id "ENSMUSP00000119381.1"; transcript_support_level "5"; mgi_id "MGI:109257"; tag "not_organism_supported"; tag "mRNA_start_NF"; tag "mRNA_end_NF"; tag "cds_start_NF"; tag "cds_end_NF"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084246.2";
At this point I can only guess that the length of the UTRs and the choice between 5' UTR and 3' UTR is based on their position in the chromosome compared with the position of exons, coding sequences, stop and start codons etc.
Let's try to use "UTR" and see what happens.
Best Fabio
Hi Fabio, Thanks a lot. Sure! I checked it, Still the annotation table consists of '0's for both 5UTR and 3UTR. As I showed below. My Bioconductor is V 3.17. I was wondering is RiboWaltz compatible with this version?
print(annotation) transcript l_tr l_utr5 l_cds l_utr3 1: mRNA:Lalb_Chr00c01g0403611 292 0 292 0 2: mRNA:Lalb_Chr00c01g0403631 1180 0 1180 0 3: mRNA:Lalb_Chr00c01g0403641 852 0 852 0 4: mRNA:Lalb_Chr00c01g0403661 211 0 211 0 5: mRNA:Lalb_Chr00c01g0403671 220 0 220 0
It should be. What I can suggest is to run the following code, which is riboWaltz-independent (note: replace PATH_TO_GTF)
library(data.table)
txdbanno <- GenomicFeatures::makeTxDbFromGFF(file = PATH_TO_GTF, format = "gtf", dataSource = NA, organism = NA)
utr5<- suppressWarnings(GenomicFeatures::fiveUTRsByTranscript(txdbanno,use.names=T))
utr5 <- as.data.table(utr5[unique(names(utr5))])
l_utr5 <- utr5[, list(l_utr5 = sum(width)), by = list(transcript = group_name)]
utr3<- suppressWarnings(GenomicFeatures::threeUTRsByTranscript(txdbanno,use.names=T))
utr3 <-as.data.table(utr3[unique(names(utr3))])
l_utr3 <- utr3[, list(l_utr3 = sum(width)), by = list(transcript = group_name)]
These are the very functions used to retrieve the length of the UTRs starting from the GTF. If something is still wrong, then most likely it has something to do with your GTF rather than with riboWaltz itself, but let's see.
Well, I did and printed each one separately:
print(txdbanno) TxDb object: Db type: TxDb Supporting package: GenomicFeatures Data source: /mnt/picea/home/mshemirani/Git/lupinMicroPeptide/results/non_overlap/Lalbus_CDS_annot_rowexpand.gtf rganism: NA Taxonomy ID: NA miRBase build ID: NA Genome: NA Nb of transcripts: 38258 Db created by: GenomicFeatures package from Bioconductor Creation time: 2023-09-04 11:31:10 +0000 (Mon, 04 Sep 2023) GenomicFeatures version at creation time: 1.52.1 RSQLite version at creation time: 2.3.1 DBSCHEMAVERSION: 1.2 print(l_utr5) Empty data.table (0 rows and 2 cols): transcript,l_utr5 print(l_utr3) Empty data.table (0 rows and 2 cols): transcript,l_utr3
Mmmm so it seems is not riboWaltz the culprit. GenomicFeatures finds your transcripts (38258). However, for some reason it doesn't like the UTR rows. Even if it's not riboWaltz-related anymore, I'll think about it and get back to you asap. If you want, in the meanwhile you can send me the GTF (or part of it) so I can perform some tests on it.
Best Fabio
Sure! how can I send it to you? I cannot send it to you here. I read the script for annotation. I'm not familiar with GenomicsFeatures, I was wondering how the length of 5UTR/3UTR is calculated?
Best Mahnaz
Here is the script I used for generating gtf file for your reference:
''' row1 = [chrom, 'EuGene','5UTR', start_codon - 50, start_codon - 1,'.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row2 = [chrom, 'EuGene', 'start_codon', start_codon, start_codon + 2, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' %(transcript_id, gene_id)] row3 = [chrom, 'EuGene', 'exon', start_codon, stop_codon, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row4 = [chrom, 'EuGene', 'CDS', start_codon, stop_codon, '.', strand, score, 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row5 = [chrom, 'EuGene', 'stop_codon', stop_codon - 2, stop_codon, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row6 = [chrom, 'EuGene', '3UTR', stop_codon + 1, stop_codon + 50, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)]
return [row1, row2, row3, row4, row5, row6]
'''
Hi Mahnaz, sorry for the super-late reply. At least I'm bringing a potential solution for you issue (even if you changed approach in the meanwhile, it might be useful for the future).
To understand what was going on, I introduced several changes to you original GTF i.e.
Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
to test whether one modification or a combination of them would have led to a different result.
After a while, I realized that the exon(s) (only one here) listed in this chunk of GTF didn't include the whole transcript. Not only it didn't include the stop codon but, most importantly for you issue, it didn't include the UTRs. This is why the GenomicFeatures
-based script didn't work as expected.
I changed the chunk above as follows, where only the first and last nucleotide of the exon are different to make the exon span from the beginning of the transcript to its end.
Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome exon 1107 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Now, putting only these six lines in what I called test.gtf and running
create_annotation("C:/.../test.gtf")
I obtained
transcript l_tr l_utr5 l_cds l_utr3
1: mRNA:Lalb_Chr01fs1157 116 50 16 50
This is exactly what it should be.
With the goal of covering the whole transcript with exon objects, another option is not to stretch the existing exon but to add some of them to fill the gaps.
This is it. Let me know if I can do anything else for you related to this topic. If not, you can close this issue and feel free to open new ones whenever you need help.
Best Fabio
Hi there, I'm closing this issue due to inactivity. Please re-open if required.
Best Fabio
Hi, I have used a gtf file for annotation of the script. When I print the created annotation file from this file I get a table with the following information:
as you can see '0' is donated to 5UTR and 3UTR. This is for all transcripts. I have generated gtf file myself and I'll get the same with both format of 'five_prime_UTR'/'three_prime_UTR' and '5UTR'/'3UTR'. I was wondering how a gft file should look like in order to be used by RioWaltz