GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
128 stars 25 forks source link

Can TAMA create a bed12 file for just gene models represented gene_report.txt? #64

Closed noor-albader closed 2 years ago

noor-albader commented 2 years ago

Hi Richard,

Thank you for creating TAMA tools for ISO-Seq analyses!

I'm interested in the gene models reported in the output file prefix_gene_report.txt (after running the tama_merge tool). Im wondering if its possible to obtain just the gene models reported in prefix_gene_report.txt in a bed12 or fasta file.

I'll use G1 as an example why I would need a bed12 for the gene models reported in prefix.gene_report.txt G1 in my study has 5 transcript models that come from different samples; G1 rows in my prefix.bed:

ChrK01  538     4406    G1;G1.1 40      +       538     4406    200,0,255       6       508,233,72,86,78,1036   0,757,1104,1687,2621,2832
ChrK01  588     4439    G1;G1.2 40      +       588     4439    255,200,0       6       458,233,72,86,78,1069   0,707,1054,1637,2571,2782
ChrK01  1236    4406    G1;G1.3 40      +       1236    4406    200,0,255       5       292,72,86,78,1036       0,406,989,1923,2134
ChrK01  2425    4406    G1;G1.4 40      +       2425    4406    200,0,255       2       468,1247        0,734
ChrK01  3001    4406    G1;G1.5 40      +       3001    4406    200,0,255       2       236,1036        0,369

In my prefix_gene_report.txt, the start and end coordinated reported was:

gene_id num_clusters    num_final_trans sources chrom   start   end     source_genes    source_summary
G1      9       5       0h-Leaf,0h-Rhizome,12h-Leaf,24h-Root,72h-Leaf,72h-Rhizome,72h-Root      ChrK01  538     4439    0h-Leaf_G1,0h-Rhizome_G1,12h-Leaf_G1,24h-Root_G1,72h-Leaf_G1,72h-Rhizome_G1,72h-Root_G1 0h-Leaf:1,0h-Rhizome:1,12h-Leaf:1,24h-Root:1,72h-Leaf:1,72h-Rhizome:1,72h-Root:1

Note how G1 on Chrk01 starts from 538 and ends on 4439 which doesn't align to any given transcript model in my bed file, which is fine except for the fact that I would like to investigate the gene models from gene_report.txt to compare the presence/absence of genes between samples of my study (prior to investigating transcript models).

any ideas? Thank you, Noor

GenomeRIK commented 2 years ago

Hello Noor,

Thank you for using TAMA!

I am not sure I completely understand your question so correct me if I am wrong in my interpretation.

The gene start and end info in the "prefix_gene_report.txt" file shows the outer limits of the gene region as defined across all transcript models within that gene. So it takes the the most upstream start and the most downstream end if that makes sense? This is why no transcripts in the model have those exact coordinates.

If you want to compare the presence/absence of genes between samples you just need to look at the 4th column/field in the "prefix_gene_report.txt" file. From your example above, these samples have presence of gene G1 "0h-Leaf,0h-Rhizome,12h-Leaf,24h-Root,72h-Leaf,72h-Rhizome,72h-Root". All the omitted samples do not have that gene present.

Let me know if this makes sense or if you have more questions about this.

Thank you, Richard

noor-albader commented 2 years ago

Hello Richard, Thanks for the quick reply! I see that the gene models in gene report doesn't represent a gene model per say but just the outer most limits of any given transcript model to show absence/presence of any given loci, got it! I will use the 5th, 6th, 7th, and 1st column to create a bed file of genes.

I had a couple of small technical questions about the TAMA_GO ORF and NMD predictions pipeline: I was also wondering why ORF finder (tama_orf_seeker.py) increases the number of total transcripts I have? Is this because of different frame shifts per transcript model?

Does the blastp output format matter for the tama_orf_blastp_parser.py?

Does the blast parser ( tama_orf_blastp_parser.py) choose the top hit per transcript frame created by the orf finder? The last question is important so users can use the unique transcript IDs to the source sample.

Thank you, Noor

GenomeRIK commented 2 years ago

Hello Noor,

I am not sure I understand why you are trying to make a bed file for the genes? I am assuming you mean a typical 4 column bed and not the bed12 format? If you want to compare on gene level then you already have the information you need in the gene report file. Is there something else you are trying to see?

Regarding you other questions:

I was also wondering why ORF finder (tama_orf_seeker.py) increases the number of total transcripts I have? Is this because of different frame shifts per transcript model?

Yes the output file represents the different frames that are possible for each transcript.

Does the blastp output format matter for the tama_orf_blastp_parser.py?

Yes this matters a lot. The standard "human readable" output is necessary because it is the only one that contains the actual alignment which TAMA uses to run it's own assessment on hit scoring.

Does the blast parser ( tama_orf_blastp_parser.py) choose the top hit per transcript frame created by the orf finder? The last question is important so users can use the unique transcript IDs to the source sample.

Yes that is how it works.

Cheers, Richard

noor-albader commented 2 years ago

Hi Richard, Thank you for your replies. You can close the ticket. Best, Noor