Running TAMA ORF + NMD prediction from non-TAMA output

fairliereese commented 2 years ago

Hi there,

I'm trying to use the TAMA ORF and NMD tools on TALON output. I've successfully run everything on the manual up through step 4. I had to substitute step 1 however for converting my TALON GTF into a FASTA file, as opposed to the TAMA bed file. This means that I do not have the aforementioned bed file to use as input for step 5 or tama_cds_regions_bed_add.py.

I'm wondering if I can either get some guidance on if 1) this was the right approach and / or 2) if there's some way I can get step 5 running from the data that I currently have.

Below is the code I've used to run TAMA so far:

gtf=~/mortazavi_lab/data/old_mousewg/brain/talon/brain_talon_observedOnly.gtf
opref=~/mortazavi_lab/data/old_mousewg/brain/tama/brain
fasta=${opref}.fa
ref=~/mortazavi_lab/ref/mm10/mm10_sirv4.fasta

# convert gtf to fasta
gffdir=~/mortazavi_lab/bin/gffread/
${gffdir}./gffread  \
  -w $fasta \
  -g $ref \
  $gtf

# call orfs / nmd with tama
tamadir=~/mortazavi_lab/bin/tama/tama_go/orf_nmd_predictions/
orf=${opref}_orf.fa
python ${tamadir}tama_orf_seeker.py \
  -f ${fasta} \
  -o ${orf}

# blast protein sequences against known protein sequences
blastdir=~/mortazavi_lab/bin/ncbi-blast-2.12.0+/bin/
ref_pc=~/mortazavi_lab/ref/gencode.vM21/gencode.vM21.pc_translations
out=${opref}_blastp.out
${blastdir}./blastp \
  -evalue 1e-10 \
  -num_threads 32 \
  -db ${ref_pc} \
  -query ${orf} > \
  ${out}

blastp=$out
out=${opref}_blastp_parsed.tsv

# parse blastp output
python ${tamadir}tama_orf_blastp_parser.py \
  -b ${blastp} \
  -o ${out}

GenomeRIK commented 2 years ago

Hi Fairlie!

Thanks for trying out TAMA!

I think all we need here is a convertor from TALON GTF to TAMA BED. Sorry but could you send a copy paste of a few lines from the TALON GTF you are using as input?

I have some GTF/GFF to BED convertors so one of them might work already or I can modify one to work with TALON.

Thank you, Richard

fairliereese commented 2 years ago

Here's a snippet of a GTF that is output from TALON:

chr1    HAVANA  gene    3206523 3215632 .   -   .   gene_id "ENSMUSG00000051951.5"; gene_name "Xkr4"; gene_status "KNOWN"; gene_type "protein_coding"; talon_gene "3"; havana_gene "OTTMUSG00000026353.2"; level "2";
chr1    HAVANA  transcript  3206523 3215632 .   -   .   gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_name "Xkr4"; gene_status "KNOWN"; gene_type "protein_coding"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; talon_gene "3"; talon_transcript "4"; havana_transcript "OTTMUST00000086624.1"; level "2"; source "HAVANA"; transcript_support_level "1";
chr1    HAVANA  exon    3213439 3215632 .   -   .   gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Xkr4"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; exon_number "1"; exon_id "ENSMUSE00000863980.1"; talon_gene "3"; talon_transcript "4"; talon_exon "6"; exon_status "KNOWN"; level "2"; source "HAVANA";
chr1    HAVANA  exon    3206523 3207317 .   -   .   gene_id "ENSMUSG00000051951.5"; transcript_id "ENSMUST00000159265.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Xkr4"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Xkr4-202"; exon_number "2"; exon_id "ENSMUSE00000867897.1"; talon_gene "3"; talon_transcript "4"; talon_exon "8"; exon_status "KNOWN"; level "2"; source "HAVANA";
chr1    HAVANA  gene    4492457 4493735 .   -   .   gene_id "ENSMUSG00000025902.13"; gene_name "Sox17"; gene_status "KNOWN"; gene_type "protein_coding"; talon_gene "20"; havana_gene "OTTMUSG00000050014.7"; level "2";
chr1    HAVANA  transcript  4492457 4493604 .   -   .   gene_id "ENSMUSG00000025902.13"; transcript_id "ENSMUST00000192505.1"; gene_name "Sox17"; gene_status "KNOWN"; gene_type "protein_coding"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "Sox17-205"; talon_gene "20"; talon_transcript "30"; havana_transcript "OTTMUST00000127268.1"; level "2"; source "HAVANA"; transcript_support_level "NA";
chr1    HAVANA  exon    4492457 4493604 .   -   .   gene_id "ENSMUSG00000025902.13"; transcript_id "ENSMUST00000192505.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Sox17"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "Sox17-205"; exon_number "1"; exon_id "ENSMUSE00001339744.1"; talon_gene "20"; talon_transcript "30"; talon_exon "130"; exon_status "KNOWN"; level "2"; source "HAVANA";d

It's pretty close the the GENCODE-style GTFs which we use as our annotation.

As a follow up, I'm sure that following the exact steps (ie converting TALON GTF -> TAMA bed -> FASTA) would be better, but will I still get trustable results by just converting the TALON GTF straight to FASTA as I have already tried?

Thanks!

GenomeRIK commented 2 years ago

Ok so you should be able to use the "tama_format_gtf_to_bed12_stringtie.py" script to convert the TALON gtf to TAMA BED12. However, it will remove all information except the gene_id and transcript_id. Also I suspect that you will not be able to convert the GTF to BED12 and use it as input to step 5 (tama_cds_regions_bed_add.py) given that the fasta headers probably do not match.

Could you copy paste a few lines of the input files for step 5 (orf and fasta files)?

If I can see this info, I should be able to modify the "tama_cds_regions_bed_add.py" script so that you can just convert the original GTF to BED12 using "tama_format_gtf_to_bed12_stringtie.py" and then use that as input in step 5.

Note that you will lose the other info from the TALON GTF in the output BED12 file because of the way I have the info organized for the CDS predictions. However, I can create another script for you to combine all the info in the end.

Cheers, Richard

fairliereese commented 2 years ago

I'll give the stringtie GTF converter a try when I get the chance.

My transcript fasta looks like this, where each header is the transcript ID.

>ENCODEHT000142197
TTACTAAATGTTTATAAATACATCTAGACAACCTGGCCCTTTCTAATCCCTAAGGACTTTGTATGGCTGA
TCTCATTTTTCAGAGCCATGTAGAGTCCCCAATTCAGCCCTGGTGAGGCCAACTGATCCTCTGCTGGACA
ATAAGTAACAGGTATTATTAAGATGAGATTTGGAATGGTTTAACTGGGTTTTTATACCATTTCTAACACA
ACATCTTTGCAACAGAAGAATGTGGAATGGTGTTTCTGCTCAGAACTCTTGAGCTGATTCATCTCCATAT
TATAAATGTATTTTAAATGTTTCAGTGAAGTGCTTGTTCAAAGGCTTCAGCACCATATTGAAATATGGCT
ACTTGCATAAATAGGATACACAGATGTGAGCTCCACAAAGGCTTTAAAATGTGGAGAGATGGTAGGAAAA
ACAACCTCAGCGAAGATCCTGTACAACTATTTCAAAACAATGTCTATTTCCTTGCCTAAATGACTAAACT
AAGTAGGGTTATAAACACTTGTAATTTTTGAGTCATAGAGTCAATATTATAAAATAAGAATGCACCATTG
AACTGTGCTAACAGTAGGGGAAATATGTGAATGATGTGGCTCCATTCAGTACTCTTATTTAATAACTAAA

My ORF fasta as output from TAMA looks like this:

>ENCODEHT000142197:F1:1548:1772:517:590:74:M
MKDTDTKRKIFYFLKELFPSLFMSTILLIYYLHKTVFMNLMSNSSFRPHLKPLSDQWGAVENSIVSLNENLAST
>ENCODEHT000142197:F3:1856:2011:619:669:51:M
MQRDRYEDGFLFSVRRDFLFSVREPEMTIFFEAKQYSGFKAKTNMLLKMRE
>ENCODEHT000142197:F2:1264:1413:422:470:49:M
MVQVPTTCHMHRAAQSNIILLGGKRVESLFSGFYSTWNPLEPSMKIKVY

I'm not too worried about losing information as it should be easily obtainable as long as I have the transcript IDs.

GenomeRIK commented 2 years ago

Could you copy paste the output file from step 4?

It should look something like this:

G1;G1.1 F1 0 719 1 239 R4GL70 90_match 99 32 G1;G1.2 F1 369 596 124 198 R4GIW7 50_match 77 66 G1;G1.3 F1 0 386 1 128 H9KZN5 90_match 91 65

fairliereese commented 2 years ago

Here are the first 10 lines:

ENSMUST00000159265.1    F1      603     830     202     276     none    no_hit  0       0
ENSMUST00000192505.1    F1      198     572     67      190     ENSMUSP00000141674.1    50_match        68      91
ENSMUST00000191647.1    F1      93      215     32      71      none    no_hit  0       0
ENSMUST00000191939.1    F3      2       205     1       67      none    no_hit  0       0
ENSMUST00000130201.7    F1      0       647     1       215     ENSMUSP00000114649.1    full_match      100     100
ENSMUST00000156816.6    F3      2       949     1       315     ENSMUSP00000115512.1    full_match      100     100
ENSMUST00000146665.2    F3      2       568     1       188     ENSMUSP00000141204.1    full_match      100     100
ENSMUST00000027036.10   F2      1       783     1       260     ENSMUSP00000027036.4    full_match      100     100
ENSMUST00000115529.7    F1      0       593     1       197     ENSMUSP00000111191.1    full_match      100     100

GenomeRIK commented 2 years ago

Just closing this thread as we came up with a solution via email.

Basically just use the "tama_format_gtf_to_bed12_ensembl.py" script to convert the TALON GTF to TAMA Bed12.

GenomeRIK / tama

Running TAMA ORF + NMD prediction from non-TAMA output #63