Closed fairliereese closed 2 years ago
Hi Fairlie!
Thanks for trying out TAMA!
I think all we need here is a convertor from TALON GTF to TAMA BED. Sorry but could you send a copy paste of a few lines from the TALON GTF you are using as input?
I have some GTF/GFF to BED convertors so one of them might work already or I can modify one to work with TALON.
Thank you, Richard
Here's a snippet of a GTF that is output from TALON:
chr1 HAVANA gene 3206523 3215632 . - . gene_id "ENSMUSG00000051951.5"; gene_name "Xkr4"; gene_status "KNOWN"; gene_type "protein_coding"; talon_gene "3"; havana_gene "OTTMUSG00000026353.2"; level "2";
chr1 HAVANA transcript 3206523 3215632 . - . gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_name "Xkr4"; gene_status "KNOWN"; gene_type "protein_coding"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; talon_gene "3"; talon_transcript "4"; havana_transcript "OTTMUST00000086624.1"; level "2"; source "HAVANA"; transcript_support_level "1";
chr1 HAVANA exon 3213439 3215632 . - . gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Xkr4"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; exon_number "1"; exon_id "ENSMUSE00000863980.1"; talon_gene "3"; talon_transcript "4"; talon_exon "6"; exon_status "KNOWN"; level "2"; source "HAVANA";
chr1 HAVANA exon 3206523 3207317 . - . gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Xkr4"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; exon_number "2"; exon_id "ENSMUSE00000867897.1"; talon_gene "3"; talon_transcript "4"; talon_exon "8"; exon_status "KNOWN"; level "2"; source "HAVANA";
chr1 HAVANA gene 4492457 4493735 . - . gene_id "ENSMUSG00000025902.13"; gene_name "Sox17"; gene_status "KNOWN"; gene_type "protein_coding"; talon_gene "20"; havana_gene "OTTMUSG00000050014.7"; level "2";
chr1 HAVANA transcript 4492457 4493604 . - . gene_id "ENSMUSG00000025902.13"; transcript_id "ENSMUST00000192505.1"; gene_name "Sox17"; gene_status "KNOWN"; gene_type "protein_coding"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "Sox17-205"; talon_gene "20"; talon_transcript "30"; havana_transcript "OTTMUST00000127268.1"; level "2"; source "HAVANA"; transcript_support_level "NA";
chr1 HAVANA exon 4492457 4493604 . - . gene_id "ENSMUSG00000025902.13"; transcript_id "ENSMUST00000192505.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Sox17"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "Sox17-205"; exon_number "1"; exon_id "ENSMUSE00001339744.1"; talon_gene "20"; talon_transcript "30"; talon_exon "130"; exon_status "KNOWN"; level "2"; source "HAVANA";d
It's pretty close the the GENCODE-style GTFs which we use as our annotation.
As a follow up, I'm sure that following the exact steps (ie converting TALON GTF -> TAMA bed -> FASTA) would be better, but will I still get trustable results by just converting the TALON GTF straight to FASTA as I have already tried?
Thanks!
Ok so you should be able to use the "tama_format_gtf_to_bed12_stringtie.py" script to convert the TALON gtf to TAMA BED12. However, it will remove all information except the gene_id and transcript_id. Also I suspect that you will not be able to convert the GTF to BED12 and use it as input to step 5 (tama_cds_regions_bed_add.py) given that the fasta headers probably do not match.
Could you copy paste a few lines of the input files for step 5 (orf and fasta files)?
If I can see this info, I should be able to modify the "tama_cds_regions_bed_add.py" script so that you can just convert the original GTF to BED12 using "tama_format_gtf_to_bed12_stringtie.py" and then use that as input in step 5.
Note that you will lose the other info from the TALON GTF in the output BED12 file because of the way I have the info organized for the CDS predictions. However, I can create another script for you to combine all the info in the end.
Cheers, Richard
I'll give the stringtie GTF converter a try when I get the chance.
My transcript fasta looks like this, where each header is the transcript ID.
>ENCODEHT000142197
TTACTAAATGTTTATAAATACATCTAGACAACCTGGCCCTTTCTAATCCCTAAGGACTTTGTATGGCTGA
TCTCATTTTTCAGAGCCATGTAGAGTCCCCAATTCAGCCCTGGTGAGGCCAACTGATCCTCTGCTGGACA
ATAAGTAACAGGTATTATTAAGATGAGATTTGGAATGGTTTAACTGGGTTTTTATACCATTTCTAACACA
ACATCTTTGCAACAGAAGAATGTGGAATGGTGTTTCTGCTCAGAACTCTTGAGCTGATTCATCTCCATAT
TATAAATGTATTTTAAATGTTTCAGTGAAGTGCTTGTTCAAAGGCTTCAGCACCATATTGAAATATGGCT
ACTTGCATAAATAGGATACACAGATGTGAGCTCCACAAAGGCTTTAAAATGTGGAGAGATGGTAGGAAAA
ACAACCTCAGCGAAGATCCTGTACAACTATTTCAAAACAATGTCTATTTCCTTGCCTAAATGACTAAACT
AAGTAGGGTTATAAACACTTGTAATTTTTGAGTCATAGAGTCAATATTATAAAATAAGAATGCACCATTG
AACTGTGCTAACAGTAGGGGAAATATGTGAATGATGTGGCTCCATTCAGTACTCTTATTTAATAACTAAA
My ORF fasta as output from TAMA looks like this:
>ENCODEHT000142197:F1:1548:1772:517:590:74:M
MKDTDTKRKIFYFLKELFPSLFMSTILLIYYLHKTVFMNLMSNSSFRPHLKPLSDQWGAVENSIVSLNENLAST
>ENCODEHT000142197:F3:1856:2011:619:669:51:M
MQRDRYEDGFLFSVRRDFLFSVREPEMTIFFEAKQYSGFKAKTNMLLKMRE
>ENCODEHT000142197:F2:1264:1413:422:470:49:M
MVQVPTTCHMHRAAQSNIILLGGKRVESLFSGFYSTWNPLEPSMKIKVY
I'm not too worried about losing information as it should be easily obtainable as long as I have the transcript IDs.
Could you copy paste the output file from step 4?
It should look something like this:
G1;G1.1 F1 0 719 1 239 R4GL70 90_match 99 32 G1;G1.2 F1 369 596 124 198 R4GIW7 50_match 77 66 G1;G1.3 F1 0 386 1 128 H9KZN5 90_match 91 65
Here are the first 10 lines:
ENSMUST00000159265.1 F1 603 830 202 276 none no_hit 0 0
ENSMUST00000192505.1 F1 198 572 67 190 ENSMUSP00000141674.1 50_match 68 91
ENSMUST00000191647.1 F1 93 215 32 71 none no_hit 0 0
ENSMUST00000191939.1 F3 2 205 1 67 none no_hit 0 0
ENSMUST00000130201.7 F1 0 647 1 215 ENSMUSP00000114649.1 full_match 100 100
ENSMUST00000156816.6 F3 2 949 1 315 ENSMUSP00000115512.1 full_match 100 100
ENSMUST00000146665.2 F3 2 568 1 188 ENSMUSP00000141204.1 full_match 100 100
ENSMUST00000027036.10 F2 1 783 1 260 ENSMUSP00000027036.4 full_match 100 100
ENSMUST00000115529.7 F1 0 593 1 197 ENSMUSP00000111191.1 full_match 100 100
Just closing this thread as we came up with a solution via email.
Basically just use the "tama_format_gtf_to_bed12_ensembl.py" script to convert the TALON GTF to TAMA Bed12.
Hi there,
I'm trying to use the TAMA ORF and NMD tools on TALON output. I've successfully run everything on the manual up through step 4. I had to substitute step 1 however for converting my TALON GTF into a FASTA file, as opposed to the TAMA bed file. This means that I do not have the aforementioned bed file to use as input for step 5 or
tama_cds_regions_bed_add.py
.I'm wondering if I can either get some guidance on if 1) this was the right approach and / or 2) if there's some way I can get step 5 running from the data that I currently have.
Below is the code I've used to run TAMA so far: