Data munging for the app does not handle de novo ORFs as it should

eboileau commented 1 year ago

Description

While preparing a large compendium of human ORFs with a de novo assembly, we noted that summarize_rpbp_predictions.py does not handle the information correctly.

Transcript information (biotype, etc.) is added for both annotated and de novo, and merged with the ORFs on the transcript id, creating duplicate entries, where de novo transcripts have an annotated id, but a different structure (which we cannot use for merging)
We also noted that some ORFs have an identical id (same transcript, same start-end), but are different, stemming from the fact that one comes from annotated, the other from de novo.
Also, when preparing the IGV input file, this can have consequences such as an ORF being labelled as Novel altCDS, but with the structure of it's matching annotated ORF.

eboileau commented 1 year ago

Actually, part of the problem stems from the way rpbp handles a de novo annotation, by merging/concatenating the files. In each annotated and de novo files, we shouldn't find any duplicate entries, but it is possible that e.g. transcript ids are the same between both files, in spite the structure. We do find ~85 of these. We also need to revise the numbering of ORFs in this case.

eboileau commented 1 year ago

Ok, for the duplicate ORF ids, we added an exception. The responsibility lies with the user to clean it's de novo annotation, and make sure "problematic" transcripts are removed.
For the biotype, etc. info that is added to the app, we don't want to do complicated wrangling, this is a problem only when adding de novo annotations, where assigned transcript ids are identical to the annotated ones. In this case, we favour the annotated one (duplicates are simply removed in default order). It is important to note that this is only for visualisation (ORF table).

dieterich-lab / rp-bp

Data munging for the app does not handle de novo ORFs as it should #154