LIONS baseline comparisons

While not explicitly stated, LIONS has a built-in measurement of the ability to detect TE-initiated transcripts compared to the baseline method of ab initio transcript assembly since a component of the pipeline is to construct contigs via Cufflinks software for a starting transcriptome. Since assembly software require explicit splice-site detection and reads spanning those splice junctions to construct a transcript [reference] exons, sensitivity for complete assembly to the 5’ end of a transcript (which also has a decrease in coverage) is limited.

For example, in Supplementary Figure 1C, take the “Alternative Isoform” as ground truth. If the assembled contig is complete to the 5’ end and includes Exon1B-Exon2-Exon3, the intersection between Exon1B and the LTR would fall under the “UpEdge” classification. If the assembled contig is incomplete to the 5’ first exon and only includes (ground truth) Exon2-Exon3, the intersection between Exon2 and the LTR would fall under the “Up” classification. This is confounded by cases where multiple isoforms exist (such as the Exon1A transcript), which may be detected, while alternative isoforms may be missed. In both such cases, the number of “UpEdge” and “Exon Inside” cases overlapping the first exon of an assembled contig is a direct measurement of baseline detection levels, compared to the “Up” cases.

In the 100M and 200M read-depth H1esc and K562 transcriptome simulation datasets, there are 69.2% (n = 764) “Up”, to (224 + 166) “UpEdge” and “Einside” classified TE-initiated transcripts, respectively. In the 21 libraries of the Hodgkin Lymphoma dataset (12 cell line and 9 B-cell controls), the number of “Up” to (“UpEdge” + “Exon Inside”) cases are 10,306 to (2,899 + 1,081), respectively. LIONS significantly increases the sensitivity of detecting TE-initiated transcripts over baseline methods.

Pending

TEtranscripts and the software compared in their manuscript are designed for differential expression analysis of TEs. This is a distinct computational problem, the main challenge being mapping/assigning reads to a family of TEs under different conditions, whereas LIONS is concerned with the intersection of individual TE loci to a particular (often non-TE) transcript. As the objectives of each method is quite distinct, the underlying assumptions and models of how reads should be considered are distinct and not fairly comparable. We agree that LIONS should be benchmarked against existing methods and have included a section on the performance of LIONS in comparison to baseline technologies (See above).

This question has arisen multiple times in communication with researchers and we have added a statement explaining this to the FAQ of the LIONS manual.

While you may run LIONS with pre-computed alignments, the standard parameters in the LIONS pipeline for TopHat2 alignment includes the parameter - -report-secondary-alignments. This means that when a read has multiple equally scoring alignments at different locus in the genome, one of these locus is randomly selected as the “primary alignment” and the others are flagged (bitwise flag 256) as a “secondary alignment”.

The reasoning behind this is that the objective of LIONS is in the accurate detection of the 5’ end of transcripts which intersect a TE, and not with TE-expression quantification. Allowing for secondary alignments will mean that a cluster of reads will be mapped to multiple loci for highly similar repeats and each of these loci will be analyzed as a potential TE-initiation site.

The detection algorithm of LIONS includes a search for “chimeric fragments”, those in which a paired reads map to i) a unique-sequence exon and ii) a TE (which may or may not be uniquely mapped). In addition, only read-pairs which map to the same chromosome and are within 250 kb of one another are considered. In this way, a group of non-unique TE reads must be “connected” to some uniquely mapping reads for inclusion.

To avoid over-estimation of read-coverage of repetitive sequences, read quantification steps are performed on the sub-set of only primary alignments (setting “-F 256”). If you are interested in quantifying TE-expression, it is best to use software specifically designed for this purpose such as TEtranscripts, or at the minimum exclude secondary alignments with samtools view -F 256.

As a caveat, this may theoretically give rise to very particular type of ambiguous TE-initiation call. If upstream of a unique exon there are two identical TE sequences (such as two LTRs in a ERV) and transcription initiates and splices within one of the LTRs, then it is impossible to know if the 5’ or 3’ LTR is responsible for TE-initiation without additional experiments. In such a case, both LTRs will be reported as a TE-initiation event with that exon. In the authors experience in the human genome, no such case has been measured, although analysis of mouse genome in which LTRs are much younger, this is presumably more frequent.

ababaian / LIONS

LIONS baseline comparisons #5

Comments

Comments