GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

How to get "primary genes" #29

Closed palfalvi closed 4 years ago

palfalvi commented 4 years ago

Hi,

First, thank you for the tool, it is a pleasure to use with both ONT and PB data.

I have generated ~29,000 genes with ~130k transcripts in total from multiple datasets using the TAMA pipeline, including ORF/NMD prediction. Now I have the bed (and gtf) files for the annotations and for most purposes (RNA-seq mapping, promoter analysis) it is perfect.

However, I would like to have a "primary transcript" list, both mRNA and CDS/AA for e.g. gene family tree building. I searched every corner of the internet, and although it sounds easy, there is no definitive answer or tool. The closest I got was using gffread with options -M -K, but it still did not merge all isoforms. I just wonder if you have any solutions for this, it may be helpful for other people, too to have readily available in tama-go (e.g. longest isoform, most abundant isoform or based on the ORF prediction).

Cheers, Gergo

GenomeRIK commented 4 years ago

Hi Gergo!

Thank you for using TAMA and for the kind words!

For the "primary transcript" list, are you looking to have 1 representative transcript model per gene? Or 1 transcript model per unique CDS? I think if I can understand exactly what you are looking for I can advise with something that may already exist in TAMA or something I can add. I actually have a lot of additional "TAMA" tools that I have not uploaded to the repo yet so I could have something that already does what you are asking for.

Thank you! Richard

palfalvi commented 4 years ago

Hey Richard,

Thanks for the quick answer! I am looking for 1 representative transcript (and its corresponding CDS) per gene.

GenomeRIK commented 4 years ago

Hi Gergo,

Ok I was discussing this with someone else and this is not quite a trivial thing to do correctly. So basically no matter how you do it, the selection of the representative transcript will be somewhat arbitrary.

I can, however, modify the "tama_format_id_filter.py" tool: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting

To allow you to keep only 1 transcript model per gene from your ORF/NMD output bed file using a ranking system based on CDS prediction and protein hit. However, I just want to warn that this does not mean the best transcript model will be selected for each gene as "best" is somewhat subjective. Even Ensembl and NCBI do not reduce their transcriptomes in this way.

So I can make this update for you but I think it would be better to explore how you are building the gene family tree since doing this filtering will likely impart unintended biases to your analyses. I think maybe the issue you are dealing with is how to handle multiple isoforms per gene when building gene family trees. There are ways to do this which I think will likely be a better solution. If you can share more about your end goal with this analysis, I can make suggestions with how to proceed using multiple isoform information.

You can email me to keep your project information private. Please use the email address on the main page of the TAMA repo.

Hope that helps! Richard

palfalvi commented 4 years ago

Hi Richard,

Really appreciate your answer. I understand, that any selection of one transcript per gene is biased, let it be longest, most abundant or having the best hit. Using all transcripts is preferable and luckily a lot of tools can handle it (thinking of RNA-seq or similar), but there are applications, when 1, one might want to perform gene level analysis of the sequence, e.g for evolutionary context between species or 2, computational time and performance could be significantly higher without actual biological significance in the result.

A good example is orthology prediction, e.g with OrthoFinder. They recommend to use the longest transcript and I just found they have a script for that. It can, however run with transcripts, but then have to think the N species x ~30,000 genes vs ~100,000 transcript situations for computation. Also, there will be probably no or way much less “single copy orthologs” for species tree reconstruction and so.

Altogether it is a simple sounding but rather complicated question and I would appreciate any input on this topic.

Cheers, Gergo

PS.: Thank you for the generous offer for email discussion, however I do not think it is necessary at this point as I don’t need to share much details and others could have similar questions so it can be better to leave it for the public.

GenomeRIK commented 4 years ago

Hi Gergo,

I am not sure how OrthoFinder works exactly but one issue I can see happening with taking the longest transcript model as the representative is that the longest model could have erroneous splice junctions calls leading to frame shifts in the ORF. If OrthoFinder relies on the ORF for predictions then this could cause issues. For the case of simplifying Ensembl annotations this is not an issue as all models are verified for the ORF. However, for the long read based annotations (without extensive curation) there can still be many models with this issue depending on how the analysis was performed.

So this is why I suggested using the ORF/NMD outputs to select best transcripts as there is information in there to identify if the representative model contains a true ORF.

However, it could be that OrthoFinder does not use the predicated peptide sequence for their analysis. In this case, the longest transcript model would probably be the best one to use.

Do you know if OrthoFinder uses the predicted peptide sequences?

If they do, then I can make the modifications to the filtering/format tool to allow you to get the "best" transcript model representative per gene.

Cheers, Richard

palfalvi commented 4 years ago

Hi Richard,

OrthoFinder needs only one proteome.fasta file per species as input.

I tried to extract it from the gtf file generated with TAMA using gffread, and select the longest ones per gene, but then there is the issue you mentioned. There were around 180 proteins with internal stop codons (at least with a "." in the AA sequence, I assume that is a stop). If you could provide a script which selects true ORFs after ORF/NMD, maybe with a secondary selection of longest, that could be a better representation of the proteome.

Thank you very much again, it is a lot of help to discuss this issue.

Cheers, Gergo

GenomeRIK commented 4 years ago

Hi Gergo,

Ok I think I understand now.

I made a new tool for you to use: tama_filter_primary_transcripts_orf.py

You can find instructions here: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Transcript-Filtering

It's in the filter_transcript_models folder.

Let me know if it works for you and if it is doing what you wanted.

Thank you, Richard

palfalvi commented 4 years ago

Hi Richard,

Sorry for the late answer. I used the tama_filter_primary_transcripts_orf.py script and it's working nicely, thank you.

Best, Gergo