Closed noor-albader closed 2 years ago
Hello Noor,
Thank you for using TAMA!
I am not sure I completely understand your question so correct me if I am wrong in my interpretation.
The gene start and end info in the "prefix_gene_report.txt" file shows the outer limits of the gene region as defined across all transcript models within that gene. So it takes the the most upstream start and the most downstream end if that makes sense? This is why no transcripts in the model have those exact coordinates.
If you want to compare the presence/absence of genes between samples you just need to look at the 4th column/field in the "prefix_gene_report.txt" file. From your example above, these samples have presence of gene G1 "0h-Leaf,0h-Rhizome,12h-Leaf,24h-Root,72h-Leaf,72h-Rhizome,72h-Root". All the omitted samples do not have that gene present.
Let me know if this makes sense or if you have more questions about this.
Thank you, Richard
Hello Richard, Thanks for the quick reply! I see that the gene models in gene report doesn't represent a gene model per say but just the outer most limits of any given transcript model to show absence/presence of any given loci, got it! I will use the 5th, 6th, 7th, and 1st column to create a bed file of genes.
I had a couple of small technical questions about the TAMA_GO ORF and NMD predictions pipeline: I was also wondering why ORF finder (tama_orf_seeker.py) increases the number of total transcripts I have? Is this because of different frame shifts per transcript model?
Does the blastp output format matter for the tama_orf_blastp_parser.py?
Does the blast parser ( tama_orf_blastp_parser.py) choose the top hit per transcript frame created by the orf finder? The last question is important so users can use the unique transcript IDs to the source sample.
Thank you, Noor
Hello Noor,
I am not sure I understand why you are trying to make a bed file for the genes? I am assuming you mean a typical 4 column bed and not the bed12 format? If you want to compare on gene level then you already have the information you need in the gene report file. Is there something else you are trying to see?
Regarding you other questions:
I was also wondering why ORF finder (tama_orf_seeker.py) increases the number of total transcripts I have? Is this because of different frame shifts per transcript model?
Yes the output file represents the different frames that are possible for each transcript.
Does the blastp output format matter for the tama_orf_blastp_parser.py?
Yes this matters a lot. The standard "human readable" output is necessary because it is the only one that contains the actual alignment which TAMA uses to run it's own assessment on hit scoring.
Does the blast parser ( tama_orf_blastp_parser.py) choose the top hit per transcript frame created by the orf finder? The last question is important so users can use the unique transcript IDs to the source sample.
Yes that is how it works.
Cheers, Richard
Hi Richard, Thank you for your replies. You can close the ticket. Best, Noor
Hi Richard,
Thank you for creating TAMA tools for ISO-Seq analyses!
I'm interested in the gene models reported in the output file prefix_gene_report.txt (after running the tama_merge tool). Im wondering if its possible to obtain just the gene models reported in prefix_gene_report.txt in a bed12 or fasta file.
I'll use G1 as an example why I would need a bed12 for the gene models reported in prefix.gene_report.txt G1 in my study has 5 transcript models that come from different samples; G1 rows in my prefix.bed:
In my prefix_gene_report.txt, the start and end coordinated reported was:
Note how G1 on Chrk01 starts from 538 and ends on 4439 which doesn't align to any given transcript model in my bed file, which is fine except for the fact that I would like to investigate the gene models from gene_report.txt to compare the presence/absence of genes between samples of my study (prior to investigating transcript models).
any ideas? Thank you, Noor