ababaian / LIONS

LIONS is a bioinformatic analysis pipeline which brings together a few pieces of software and some home-brewed scripts to annotate a paired-end RNAseq library to detect TE-intiated transcripts
GNU General Public License v3.0
27 stars 13 forks source link

LIONS baseline comparisons #5

Closed ababaian closed 5 years ago

ababaian commented 5 years ago

Comments

  1. My only major concern is that there should be a comparison of the East Lions module to a baseline method. As referenced in the introduction, many studies have identified TE-derived transcripts, through various (perhaps less elegant) methods. It would be valuable for the manuscript to demonstrate how LIONS compares to current methods. For example, for the evaluated dataset, de novo transcript assembly could be performed (e.g., via Cufflinks or Stringtie) to see if any LIONS-discovered chimeric transcripts are recapitulated.
  2. Page 4, line 53: Choice of parameters should be better justified. At minimum, you should show the impact of varying some of these thresholds and show how many calls are made on some ENCODE datasets (or other dataset) for different choices of thresholds.

Comments

  1. The method was not compared with existing methods. Understandably there aren’t many other methods doing exactly the same type of analysis, but the authors should at least compare the validity/appropriate TE mapping to TE-specialized RNA software such as TEtranscripts. Additionally, computational simulations could be done to validate the findings and the choice of thresholds.
    1. Mapping reads to TEs is very challenging due to multi-mapping, and although the manuscript mentions that multi-mapped reads are flagged and conserved, the authors do not mention how they are then properly mapped or used, or how this whole issue is addressed.
ababaian commented 5 years ago

While not explicitly stated, LIONS has a built-in measurement of the ability to detect TE-initiated transcripts compared to the baseline method of ab initio transcript assembly since a component of the pipeline is to construct contigs via Cufflinks software for a starting transcriptome. Since assembly software require explicit splice-site detection and reads spanning those splice junctions to construct a transcript [reference] exons, sensitivity for complete assembly to the 5’ end of a transcript (which also has a decrease in coverage) is limited.

For example, in Supplementary Figure 1C, take the “Alternative Isoform” as ground truth. If the assembled contig is complete to the 5’ end and includes Exon1B-Exon2-Exon3, the intersection between Exon1B and the LTR would fall under the “UpEdge” classification. If the assembled contig is incomplete to the 5’ first exon and only includes (ground truth) Exon2-Exon3, the intersection between Exon2 and the LTR would fall under the “Up” classification. This is confounded by cases where multiple isoforms exist (such as the Exon1A transcript), which may be detected, while alternative isoforms may be missed. In both such cases, the number of “UpEdge” and “Exon Inside” cases overlapping the first exon of an assembled contig is a direct measurement of baseline detection levels, compared to the “Up” cases.

In the 100M and 200M read-depth H1esc and K562 transcriptome simulation datasets, there are 69.2% (n = 764) “Up”, to (224 + 166) “UpEdge” and “Einside” classified TE-initiated transcripts, respectively. In the 21 libraries of the Hodgkin Lymphoma dataset (12 cell line and 9 B-cell controls), the number of “Up” to (“UpEdge” + “Exon Inside”) cases are 10,306 to (2,899 + 1,081), respectively. LIONS significantly increases the sensitivity of detecting TE-initiated transcripts over baseline methods.

  1. Pending

TEtranscripts and the software compared in their manuscript are designed for differential expression analysis of TEs. This is a distinct computational problem, the main challenge being mapping/assigning reads to a family of TEs under different conditions, whereas LIONS is concerned with the intersection of individual TE loci to a particular (often non-TE) transcript. As the objectives of each method is quite distinct, the underlying assumptions and models of how reads should be considered are distinct and not fairly comparable. We agree that LIONS should be benchmarked against existing methods and have included a section on the performance of LIONS in comparison to baseline technologies (See above).

4.

This question has arisen multiple times in communication with researchers and we have added a statement explaining this to the FAQ of the LIONS manual.

While you may run LIONS with pre-computed alignments, the standard parameters in the LIONS pipeline for TopHat2 alignment includes the parameter - -report-secondary-alignments. This means that when a read has multiple equally scoring alignments at different locus in the genome, one of these locus is randomly selected as the “primary alignment” and the others are flagged (bitwise flag 256) as a “secondary alignment”.

The reasoning behind this is that the objective of LIONS is in the accurate detection of the 5’ end of transcripts which intersect a TE, and not with TE-expression quantification. Allowing for secondary alignments will mean that a cluster of reads will be mapped to multiple loci for highly similar repeats and each of these loci will be analyzed as a potential TE-initiation site.

The detection algorithm of LIONS includes a search for “chimeric fragments”, those in which a paired reads map to i) a unique-sequence exon and ii) a TE (which may or may not be uniquely mapped). In addition, only read-pairs which map to the same chromosome and are within 250 kb of one another are considered. In this way, a group of non-unique TE reads must be “connected” to some uniquely mapping reads for inclusion.

To avoid over-estimation of read-coverage of repetitive sequences, read quantification steps are performed on the sub-set of only primary alignments (setting “-F 256”). If you are interested in quantifying TE-expression, it is best to use software specifically designed for this purpose such as TEtranscripts, or at the minimum exclude secondary alignments with samtools view -F 256.

As a caveat, this may theoretically give rise to very particular type of ambiguous TE-initiation call. If upstream of a unique exon there are two identical TE sequences (such as two LTRs in a ERV) and transcription initiates and splices within one of the LTRs, then it is impossible to know if the 5’ or 3’ LTR is responsible for TE-initiation without additional experiments. In such a case, both LTRs will be reported as a TE-initiation event with that exon. In the authors experience in the human genome, no such case has been measured, although analysis of mouse genome in which LTRs are much younger, this is presumably more frequent.