frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

Remove/collapse duplicate transcripts from combined quantification GTF #34

Open SamBryce-Smith opened 1 year ago

SamBryce-Smith commented 1 year ago

When subsettign to individual exons, many may be identical entirely at the sequence/region level if they are shared between different full length transcripts. THis is wasted output in the GTF (inflating its size) and also triggers a warning when generating the salmon index.

[2022-03-19 15:44:58.606] [puff::index::jointLog] [warning] Removed 10618 transcripts that were sequence duplicates of indexed transcripts.
[2022-03-19 15:44:58.606] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag

Would maybe be good to double-check for sequence duplicates prior to outputting the GTF. Could always assign a 'combined tx id' in these cases (e.g. transcript IDs combined with string separator).

As salmon index removes duplicates this shouldn't cause any downstream problems, save for potentially tx IDs disappearign between the quant GTF and salmon quantification output.