How it is possible to split gene models created from adjacent, slightly overlapping transcripts?

HegedusB commented 4 years ago

Hi Richard, You mentioned in one of your answer that it is possible to split the gene models created from adjacent, slightly overlapping transcripts. Can you explain it in a more detail? I am really curious about it! Thanks a lot! Cheers, Botond

GenomeRIK commented 4 years ago

Hi Botond,

Thank you for starting this thread. I will probably use parts of this thread for my blog if thats ok.

So basically you use the ORF/NMD pipeline to match each transcript model to known proteins. If you have a model species with a decent annotation, you could also match to the CDS predicted by that annotation which would provide better biological information.

https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-ORF-and-NMD-predictions

So if there is an Ensembl annotation for a species that is somewhat close to the species you are working with you can download the predicted peptide sequences from Ensembl and then use those as the protein database for the Blastp step.

After you run the ORF/NMD pipeline you should have a bed file file with the ORF/NMD annotations added to the 4th column.

You then need to use the "tama_format_id_filter.py" tool to pull the gene and transcript ID's from the Ensembl annotation. https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting

However, you can also do this with any source protein database but you just need to reformat the 4th column using the tool with custom settings for field parsing.

What this reformatting does is make the protein database ID's the primary gene ID's. This means that for the situations of overlapping genes that you were talking about, you can differentiate them based on their coding sequences.

The ORF/NMD pipeline will also give you information that you can use to filter out erroneous models due to mis-alignment.

There is a lot to this so if you have more questions just let me know and I will try to explain in more detail.

Thank you, Richard

HegedusB commented 4 years ago

Hi Richard, This this is a really interesting approach. Thanks for sharing with me. I have tried it out with a short test blast database (only contained the examined strains reference proteom) because I do not know how long the blast will take (I have lots of isoforms). I reached the formatting step but there I stuck. My reference proteoms protein ids are far from the Ensembl standards. For example: Copci_AmutBmut1_323571 (StrainName_proteinID). Can you explain in more detail how can I use the "tama_format_id_filter.py" program in such a case. And a question: What do you think? How can this approach handle those cases where the reference protein is produced badly by fusing two adjacent genes. In that case it is worst if I using the blast derived gene id and not the original tama gene id. I attached an example. Could be better if only separate those gene ids where the blast result indicates (multiple different perfect hit)?

Thank for your help again! Botond

GenomeRIK commented 4 years ago

Hi Botond,

Apologies for the delayed response.

For the formatting you can use the custom parameters of "tama_format_id_filter.py" (https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting). So for your bed file you can do this: python tama_format_id_filter.py -b bed_file -o output_file -s custom -r 3,2,1 -d ";"

So that would make the protein ID the gene ID and retain the TAMA transcript ID.

I need to make changes to that code to allow for more advanced filtering. Similar to what you said, I can make it so that the protein ID's are only inherited if the match is at a certain level. So this would solve the issue you see in that picture.

However, I do not understand why the protein database you are using uses the same ID's for different peptide sequences. Where did you get that from? If the ID's are not unique for the peptide sequences then it makes it difficult to work with the data. Basically you would have to reassign unique ID's.

Cheers, Richard

HegedusB commented 4 years ago

Hi Richard, Thank for the tama_format_id_filter.py command. I will try it. In the used database the protein IDs are unique. Can you tell me, why do you think that the ID's are redundant? Just a suggestion. Maybe the blastp tabular format output could be a better chose because it is a more compact format. Better to check and smaller in file size.

I am waiting to try out the new version. Cheers, Botond

GenomeRIK commented 4 years ago

Hi Botond,

In the used database the protein IDs are unique. Can you tell me, why do you think that the ID's are redundant?

Sorry maybe I misunderstood the picture. I thought the blue models at the top were from your protein database? For those blue models it looks like the 2 long models have the same ID despite being different transcript models. Is this the case?

Just a suggestion. Maybe the blastp tabular format output could be a better chose because it is a more compact format. Better to check and smaller in file size.

Yeah I wanted to use the tab format but if you use the tab format there is no way to get detailed alignment information. So in order to do more advanced alignment assessment I need the full alignment information from the standard output. It is annoying that there isn't some output format that is compact and has the detailed alignment information. However, if you know of an output format option that provides this, I could definitely rewrite the parser for it.

As for the new version, I am not sure how long it will take to produce as I am working on some other things right now. However, if you tell me the features you need right now, I might be able to put something together for your needs and then refine it later.

Thank you, Richard

GenomeRIK commented 4 years ago

Hi Botond,

I am going to close this issue thread now but if you have more questions I can open it up later.

Thank you, Richard

GenomeRIK / tama

How it is possible to split gene models created from adjacent, slightly overlapping transcripts? #25