Closed eboileau closed 1 year ago
Actually, part of the problem stems from the way rpbp handles a de novo annotation, by merging/concatenating the files. In each annotated and de novo files, we shouldn't find any duplicate entries, but it is possible that e.g. transcript ids are the same between both files, in spite the structure. We do find ~85 of these. We also need to revise the numbering of ORFs in this case.
Ok, for the duplicate ORF ids, we added an exception. The responsibility lies with the user to clean it's de novo annotation, and make sure "problematic" transcripts are removed.
For the biotype, etc. info that is added to the app, we don't want to do complicated wrangling, this is a problem only when adding de novo annotations, where assigned transcript ids are identical to the annotated ones. In this case, we favour the annotated one (duplicates are simply removed in default order). It is important to note that this is only for visualisation (ORF table).
Description
While preparing a large compendium of human ORFs with a de novo assembly, we noted that summarize_rpbp_predictions.py does not handle the information correctly.