GoekeLab / sg-nex-data

Nanopore RNA-Seq data from the Singapore Nanopore-Expression Project
94 stars 23 forks source link

Errors in the augmented annotation GTF file #65

Open rob-p opened 2 weeks ago

rob-p commented 2 weeks ago

Hi,

We (@zzare-umd, @NPSDC, and I) are trying to do some analysis of different transcript identification and quantification tools using the benchmarking data you're providing through this project. As some methods require alignment to the genome, while others use alignment to the transcriptome, we have been following the directions provided in this repository to obtain the appropriate reference and annotation files.

However, after some initial analysis, we have noticed a substantial issue with the augmented GTF file that contains the reference annotations as well as those of the SIRV and ERCC transcripts etc. (hg38_sequins_SIRV_ERCCs_longSIRVs_v5_reformatted.gtf). Ostensibly, this file contains the merged annotations for the reference transcripts as well as the synthetic transcripts. However, it is not clear exactly how this file was created. All of the source values (column 2) are listed as Bambu, which, in and of itself, is not a problem. However, it appears that the annotations themselves are corrupted, such that transcript features are listed many times (i.e. once for each exon) and the transcript start and end positions are those of the given exon features. This may be more clearly explained with a specific example:

image

Here, the original transcript ENST00000456328 is a single transcript with 3 constituent exons. However, in this modified file, the transcript record itself is repeated 3 times, each time with the coordinates matching one of the 3 exons (those features are also recorded). If we compare this to the source Ensemble gtf (matched version), we see it does not have this artifact:

image

Here, we see what we expect; one transcript record, with 3 exon child features.

This is simply one example, but most / all of the transcripts seem to be corrupted in this manner. This means that the annotation itself is incorrect for all but single-exon transcripts, which leads to unpredictable problems in subsequent quantification for methods that rely on genome alignments with a provided annotation.

Interestingly, the transcriptome sequences provided do not exhibit this problem, and, at least for the Ensemble reference transcripts, the transcript sequences match the source (Ensemble) annotations.

Could you please describe the process that was used to create the aggregated annotation file hg38_sequins_SIRV_ERCCs_longSIRVs_v5_reformatted.gtf, which might shed some light on how these artifacts have been introduced? At the same time, it would be very useful to have access to the separate reference sequences and reference annotations for the Sequin + SIRV and ERCC transcripts, so that a complete, matching, uncorrupted annotation file can be created.

Please let me know if you have any questions about the above analysis, or if there is any way we can help you to fix these annotation files.

Thanks! Rob

cying111 commented 2 weeks ago

Hi @rob-p ,

Thanks for letting us know about the issue. The corrupted file might be related to me using talon_reformat_gtf to reformat the gtf file. I have now provided a corrected version here for this. Let us know if it works for you. Alternatively, you can also use the functions from bambu to obtain the corrected version. See below for the code:

library(bambu) gtf <- "hg38_sequins_SIRV_ERCCs_longSIRVs_v5_reformatted.gtf" anno <- prepareAnnotations(gtf) writeToGTF(anno, file = "hg38_sequins_SIRV_ERCCs_longSIRVs_corrected.gtf")

For the original reference files, I might need some time to locate them and I will upload these original reference files once I find them.

Thank you again and let us know if you have further questions. Warm regards, Ying