Open ifiddes opened 6 years ago
One important question here is that there are really 2 kinds of external transcripts -- those that are known orthologs, and those that are predictions. The first kind provides a challenge to how CAT works, as they would need to be integrated directly external to consensus finding, but still undergo parental gene assignment probably? The second kind should be super easy, just need to add a section to the config file and have them follow the same code paths that the outputs to CGP/PB undergo.
Both would be good I guess, but in the present case, I am thinking about the former primarily (curated info which may include known orthologs captured by CAT). Do you have a way to give different weights to different pieces of evidence? I mean, during consensus finding, how is one prediction favored over another one in case of conflicts?
Right now, the weights are defined relative to the reference annotation set. For projection transcripts (transMap, TM(R)), this means fidelity to the reference and RNA-seq/IsoSeq support. Once those are settled, predictions are incorporated by looking for providing novel information that is also supported.
What do you consider a conflict? Right now the only conflicts considered by CAT are the case of gene family collapse, i.e. ortholog resolved projections end up in the same locus. In that case, the highest scoring gene is kept and the others removed and their names tagged in the winner's tags.
If you are interested in the former case, I think it is still not too hard. What I would do is add a subsequent step to consensus finding that takes an external annotation set and performs the same logic as parental gene assignment, which looks for transcripts with overlapping CDS on the same strand. Those would then be considered the same gene, and can be evaluated for duplicate CDS intervals and containing novel coding splices relative to the projection. I can then propagate all the original information as well as tagging the projection set with information (X transcripts were added from gene Y in annotation set Z). In the case where the external set doesn't overlap a CAT gene, then it just gets added and given its own new unique identifier, in addition to keeping whatever information it had.
I should probably also just propagate the external set entirely to the assemblyHub as its own track.
What I mean by conflict: for some gene families, I have noticed that Augustus does not do a good job at predicting the correct gene structure (trying to re-train did not help). Augustus-PPX (which uses protein-profiles) is doing a better job, yet not perfect, so I ended up doing manual curation using RNA-Seq evidence. So, in those cases, I would like to give more weight to those.
Here is an example:
I see. Augustus does have a problem with falsely splitting or fusing genes (this leads to inflated protein gene estimates in MAKER annotations).
I think maybe the approach here is to provide users with two ways to input annotation sets -- one is 'high confidence', and will always be included, and the other is 'predictions', which will be treated the same as CGP/PB. For 'high confidence', in the case like above, CAT would identify that there are overlapping genes, combine them with the high confidence gene, then throw out any that are not supported by RNA-seq. And add some useful tags saying that this happened.
That would be fantastic. We have the same issue in plants. Especially with genes in head to head orientation, sharing the same promoter and and intergenic distance <1kb. We maintain a set of 13k manually checked genes from 65 strains and they could directly be used there.
It would be very useful to have something like parent_gene_assignment.py take aligned transcripts and incorporate them into an existing annotation as you suggested.
Would this require the new transcript to have a defined CDS, or would an aligned IsoSeq transcript (with UTRs and potentially multiple ORFs) be fine?
For issue #54.