annotation table format

Mahnaz-Shemirani commented 1 year ago

Hi, I have used a gtf file for annotation of the script. When I print the created annotation file from this file I get a table with the following information:

print(annotation) transcript l_tr l_utr5 l_cds l_utr3 1: mRNA:Lalb_Chr00c01g0403611 292 0 292 0 2: mRNA:Lalb_Chr00c01g0403631 1180 0 1180 0 3: mRNA:Lalb_Chr00c01g0403641 852 0 852 0 4: mRNA:Lalb_Chr00c01g0403661 211 0 211 0 5: mRNA:Lalb_Chr00c01g0403671 220 0 220 0

as you can see '0' is donated to 5UTR and 3UTR. This is for all transcripts. I have generated gtf file myself and I'll get the same with both format of 'five_prime_UTR'/'three_prime_UTR' and '5UTR'/'3UTR'. I was wondering how a gft file should look like in order to be used by RioWaltz

fabiolauria commented 1 year ago

Hi there. riboWaltz uses standard GTF files where, to my knowledge, the UTRs field are actually labelled with 3UTR and 5UTR. Moreover, the riboWaltz function _createannotation is based on makeTxDbFromGFF, fiveUTRsByTranscript and threeUTRsByTranscriptfrom (and others) from the GenomicFeatures package, which also take as input standard GTFs. Can you please send me the GTF you passed as input, or at least some rows containing the UTRs field?

Best Fabio

Mahnaz-Shemirani commented 1 year ago

Hi, Thanks for replying. Here is an example of gtf file:

Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;

I also tried , Lalb_Chr01 genome 5UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157; Lalb_Chr01 genome 3UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;

Best Mahnaz

fabiolauria commented 1 year ago

Hi Mahnaz, I was wrong, both the 5' UTR and 3' UTR fields are labeled with just "UTR", as you can see in the following chunk from one GTF I downloaded from GENCODE here

chr13   HAVANA  transcript  100124852   100137690   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13   HAVANA  exon    100124852   100124965   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13   HAVANA  CDS 100124894   100124965   .   +   0   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13   HAVANA  start_codon 100124894   100124896   .   +   0   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000022147.14"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-201"; exon_number 1; exon_id "ENSMUSE00000732564.1"; level 2; protein_id "ENSMUSP00000022147.8"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS26725.1"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084239.1";
chr13   HAVANA  CDS 100135689   100135733   .   +   0   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13   HAVANA  stop_codon  100135734   100135736   .   +   0   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13   HAVANA  exon    100137381   100137480   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 3; exon_id "ENSMUSE00000810281.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13   HAVANA  UTR 100135734   100135739   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 2; exon_id "ENSMUSE00001246674.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13   HAVANA  UTR 100137381   100137480   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000091321.5"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-202"; exon_number 3; exon_id "ENSMUSE00000810281.1"; level 2; protein_id "ENSMUSP00000088871.5"; transcript_support_level "1"; mgi_id "MGI:109257"; tag "basic"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084241.1";
chr13   HAVANA  transcript  100124924   100131174   .   +   .   gene_id "ENSMUSG00000021645.16"; transcript_id "ENSMUST00000143937.1"; gene_type "protein_coding"; gene_name "Smn1"; transcript_type "protein_coding"; transcript_name "Smn1-207"; level 2; protein_id "ENSMUSP00000119381.1"; transcript_support_level "5"; mgi_id "MGI:109257"; tag "not_organism_supported"; tag "mRNA_start_NF"; tag "mRNA_end_NF"; tag "cds_start_NF"; tag "cds_end_NF"; havana_gene "OTTMUSG00000033496.2"; havana_transcript "OTTMUST00000084246.2";

At this point I can only guess that the length of the UTRs and the choice between 5' UTR and 3' UTR is based on their position in the chromosome compared with the position of exons, coding sequences, stop and start codons etc.

Let's try to use "UTR" and see what happens.

Best Fabio

Mahnaz-Shemirani commented 1 year ago

Hi Fabio, Thanks a lot. Sure! I checked it, Still the annotation table consists of '0's for both 5UTR and 3UTR. As I showed below. My Bioconductor is V 3.17. I was wondering is RiboWaltz compatible with this version?

print(annotation) transcript l_tr l_utr5 l_cds l_utr3 1: mRNA:Lalb_Chr00c01g0403611 292 0 292 0 2: mRNA:Lalb_Chr00c01g0403631 1180 0 1180 0 3: mRNA:Lalb_Chr00c01g0403641 852 0 852 0 4: mRNA:Lalb_Chr00c01g0403661 211 0 211 0 5: mRNA:Lalb_Chr00c01g0403671 220 0 220 0

fabiolauria commented 1 year ago

It should be. What I can suggest is to run the following code, which is riboWaltz-independent (note: replace PATH_TO_GTF)

library(data.table)
txdbanno <- GenomicFeatures::makeTxDbFromGFF(file = PATH_TO_GTF, format = "gtf", dataSource = NA, organism = NA)

utr5<- suppressWarnings(GenomicFeatures::fiveUTRsByTranscript(txdbanno,use.names=T))
utr5 <- as.data.table(utr5[unique(names(utr5))])
l_utr5 <- utr5[, list(l_utr5 = sum(width)), by = list(transcript = group_name)]

utr3<- suppressWarnings(GenomicFeatures::threeUTRsByTranscript(txdbanno,use.names=T))
utr3 <-as.data.table(utr3[unique(names(utr3))])
l_utr3 <- utr3[, list(l_utr3 = sum(width)), by = list(transcript = group_name)]

These are the very functions used to retrieve the length of the UTRs starting from the GTF. If something is still wrong, then most likely it has something to do with your GTF rather than with riboWaltz itself, but let's see.

Mahnaz-Shemirani commented 1 year ago

Well, I did and printed each one separately:

print(txdbanno) TxDb object: Db type: TxDb Supporting package: GenomicFeatures Data source: /mnt/picea/home/mshemirani/Git/lupinMicroPeptide/results/non_overlap/Lalbus_CDS_annot_rowexpand.gtf rganism: NA Taxonomy ID: NA miRBase build ID: NA Genome: NA Nb of transcripts: 38258 Db created by: GenomicFeatures package from Bioconductor Creation time: 2023-09-04 11:31:10 +0000 (Mon, 04 Sep 2023) GenomicFeatures version at creation time: 1.52.1 RSQLite version at creation time: 2.3.1 DBSCHEMAVERSION: 1.2 print(l_utr5) Empty data.table (0 rows and 2 cols): transcript,l_utr5 print(l_utr3) Empty data.table (0 rows and 2 cols): transcript,l_utr3

fabiolauria commented 1 year ago

Mmmm so it seems is not riboWaltz the culprit. GenomicFeatures finds your transcripts (38258). However, for some reason it doesn't like the UTR rows. Even if it's not riboWaltz-related anymore, I'll think about it and get back to you asap. If you want, in the meanwhile you can send me the GTF (or part of it) so I can perform some tests on it.

Best Fabio

Mahnaz-Shemirani commented 1 year ago

Sure! how can I send it to you? I cannot send it to you here. I read the script for annotation. I'm not familiar with GenomicsFeatures, I was wondering how the length of 5UTR/3UTR is calculated?

Best Mahnaz

Here is the script I used for generating gtf file for your reference:

''' row1 = [chrom, 'EuGene','5UTR', start_codon - 50, start_codon - 1,'.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row2 = [chrom, 'EuGene', 'start_codon', start_codon, start_codon + 2, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' %(transcript_id, gene_id)] row3 = [chrom, 'EuGene', 'exon', start_codon, stop_codon, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row4 = [chrom, 'EuGene', 'CDS', start_codon, stop_codon, '.', strand, score, 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row5 = [chrom, 'EuGene', 'stop_codon', stop_codon - 2, stop_codon, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)] row6 = [chrom, 'EuGene', '3UTR', stop_codon + 1, stop_codon + 50, '.', strand, '.', 'transcript_id mRNA:%s; gene_id gene:%s' % (transcript_id, gene_id)]

return [row1, row2, row3, row4, row5, row6]

'''

fabiolauria commented 1 year ago

Hi Mahnaz, sorry for the super-late reply. At least I'm bringing a potential solution for you issue (even if you changed approach in the meanwhile, it might be useful for the future).

To understand what was going on, I introduced several changes to you original GTF i.e.

Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome exon 1157 1169 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;

to test whether one modification or a combination of them would have led to a different result.

After a while, I realized that the exon(s) (only one here) listed in this chunk of GTF didn't include the whole transcript. Not only it didn't include the stop codon but, most importantly for you issue, it didn't include the UTRs. This is why the GenomicFeatures-based script didn't work as expected.

I changed the chunk above as follows, where only the first and last nucleotide of the exon are different to make the exon span from the beginning of the transcript to its end.

Lalb_Chr01 genome five_prime_UTR 1107 1156 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome start_codon 1157 1159 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome exon 1107 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome CDS 1157 1169 . + 0 transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome stop_codon 1170 1172 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;
Lalb_Chr01 genome three_prime_UTR 1173 1222 . + . transcript_id mRNA:Lalb_Chr01fs1157; gene_id gene:Lalb_Chr01fs1157;

Now, putting only these six lines in what I called test.gtf and running

create_annotation("C:/.../test.gtf")

I obtained

               transcript   l_tr  l_utr5  l_cds  l_utr3
1: mRNA:Lalb_Chr01fs1157    116      50     16      50

This is exactly what it should be.

With the goal of covering the whole transcript with exon objects, another option is not to stretch the existing exon but to add some of them to fill the gaps.

This is it. Let me know if I can do anything else for you related to this topic. If not, you can close this issue and feel free to open new ones whenever you need help.

Best Fabio

fabiolauria commented 10 months ago

Hi there, I'm closing this issue due to inactivity. Please re-open if required.

Best Fabio

LabTranslationalArchitectomics / riboWaltz

annotation table format #74