Errors in the augmented annotation GTF file

Hi,

We (@zzare-umd, @NPSDC, and I) are trying to do some analysis of different transcript identification and quantification tools using the benchmarking data you're providing through this project. As some methods require alignment to the genome, while others use alignment to the transcriptome, we have been following the directions provided in this repository to obtain the appropriate reference and annotation files.

However, after some initial analysis, we have noticed a substantial issue with the augmented GTF file that contains the reference annotations as well as those of the SIRV and ERCC transcripts etc. (hg38_sequins_SIRV_ERCCs_longSIRVs_v5_reformatted.gtf). Ostensibly, this file contains the merged annotations for the reference transcripts as well as the synthetic transcripts. However, it is not clear exactly how this file was created. All of the source values (column 2) are listed as Bambu, which, in and of itself, is not a problem. However, it appears that the annotations themselves are corrupted, such that transcript features are listed many times (i.e. once for each exon) and the transcript start and end positions are those of the given exon features. This may be more clearly explained with a specific example:

Here, the original transcript ENST00000456328 is a single transcript with 3 constituent exons. However, in this modified file, the transcript record itself is repeated 3 times, each time with the coordinates matching one of the 3 exons (those features are also recorded). If we compare this to the source Ensemble gtf (matched version), we see it does not have this artifact:

Here, we see what we expect; one transcript record, with 3 exon child features.

This is simply one example, but most / all of the transcripts seem to be corrupted in this manner. This means that the annotation itself is incorrect for all but single-exon transcripts, which leads to unpredictable problems in subsequent quantification for methods that rely on genome alignments with a provided annotation.

Interestingly, the transcriptome sequences provided do not exhibit this problem, and, at least for the Ensemble reference transcripts, the transcript sequences match the source (Ensemble) annotations.

Could you please describe the process that was used to create the aggregated annotation file hg38_sequins_SIRV_ERCCs_longSIRVs_v5_reformatted.gtf, which might shed some light on how these artifacts have been introduced? At the same time, it would be very useful to have access to the separate reference sequences and reference annotations for the Sequin + SIRV and ERCC transcripts, so that a complete, matching, uncorrupted annotation file can be created.

Please let me know if you have any questions about the above analysis, or if there is any way we can help you to fix these annotation files.

Thanks! Rob

GoekeLab / sg-nex-data

Errors in the augmented annotation GTF file #65