LIONS is a bioinformatic analysis pipeline which brings together a few pieces of software and some home-brewed scripts to annotate a paired-end RNAseq library to detect TE-intiated transcripts
The sentences structure and grammar is distracting and should be revised. The action/verb is often placed at the end of the sentence, with some sentences lacking a verb altogether
In the abstract: The authors should define TE-initiated transcripts.
Page 2 first paragraph: TEs CAN contain promoters…
Page 2 second paragraph: “TE promoters can be exploited to express a protooncogene” The word “exploited” here is awkward and implies something is guiding this process.
Page 2, line 21: define ESTs
Page 2, line 27: you have to “or” in the sentence, maybe the first one should be simply a “,”
Page 2, line 31: I would replace “Notably” by “For instance”
Page 2, line 37: Needs actual comparisons here
Page 2, line 41: “quantitative” -> “comprehensive and quantitative”
Page 2, line 43: I would remove “quantitatively” since you just used that word
Page 2, line 43-47: Clearly some components of LIONS were especially designed for cancer and for sample comparison. Perhaps some of this should also be included in your abstract?
Page 2, line 55: that sentence seems to be missing a verb, “is outputted” maybe?
Page 4, line 6: A bit repetitive, you just said what the inputs were.
Page 4, line 11: “parameter-optimized”, please explain
Page 4, line 15: “ab initio assembly”, using what? Please explain
Page 4, line 21: weird character at the end of the paragraph
Page 4, line 44: weird character
Page 4, line 45: not sure I understand your “i.e. retained introns and low abundance lincRNAs”, do you mean you exclude those? Maybe rephrase a bit?
Page 5, line 22: For “drivers”: Aren’t reads only mapping to TEs depleted in the previous step?
Page 5, line 28: weird character and also underlined here.
Page 5, line 49: maybe cite Fig 1C here?
Page 6, line 5: not sure I understand why you cite Fig 1B here since it doesn’t talk about recurrence and specificity
Page 6, line 8: I would call this “Evaluation and Conclusion”
Page 6, line 18: weird character
Sup Material: not sure on the format but Sup Figs and Sup Fig captions could probably be included into a single Sup Material document
You should had a reference to the Manual somewhere in the text.
[x] Fix Figures [Artem]
Page 3, Figure 1: Figure labels are a bit confusing as-is since we can’t easily see the 3 modules. Maybe relabel A) Module 1: Input / Initialization, B) Module 2i: Detect and Classify, C) Module 2ii: Intersect…, D) Module 3: Comparison… The “East” and “West Lions”, just added a confusing layer. Or maybe you should simply talk about two main modules otherwise.
Figure 1: 1.A doesn’t show an alignment step, although the legend mentions it. Alignment vs Assembly unclear at this point. Provide more details for the description of panel “C”
One potential limitation is the reliance on reference sequences and the TE annotations therein. This means that polymorphic TE sequences will not be considered by the software. The authors may wish to mention this, and/or provide an alternative solution for investigators who wish to consider the potential impacts of polymorphic TEs on the generation of isoforms. I’m not sure if the optional ab inito assembly step could be used in this regard.
Baseline / Simulation comparison of LIONS [ Artem ] #5
[ ] Complete
My only major concern is that there should be a comparison of the East Lions module to a baseline method. As referenced in the introduction, many studies have identified TE-derived transcripts, through various (perhaps less elegant) methods. It would be valuable for the manuscript to demonstrate how LIONS compares to current methods. For example, for the evaluated dataset, de novo transcript assembly could be performed (e.g., via Cufflinks or Stringtie) to see if any LIONS-discovered chimeric transcripts are recapitulated.
Page 4, line 53: Choice of parameters should be better justified. At minimum, you should show the impact of varying some of these thresholds and show how many calls are made on some ENCODE datasets (or other dataset) for different choices of thresholds.
a. Written response of differing functions / Mapping method [ Artem ]
The method was not compared with existing methods. Understandably there aren’t many other methods doing exactly the same type of analysis, but the authors should at least compare the validity/appropriate TE mapping to TE-specialized RNA software such as TEtranscripts. Additionally, computational simulations could be done to validate the findings and the choice of thresholds.
Mapping reads to TEs is very challenging due to multi-mapping, and although the manuscript mentions that multi-mapped reads are flagged and conserved, the authors do not mention how they are then properly mapped or used, or how this whole issue is addressed.
Prepare LIONS Input Example [ Richard ] #6
[] Complete
a. Create a reference-ready ‘Fastq to Output’ container based on the first. The idea being that all reference materials (RM, hg38, etc…) are automatically downloaded and loaded directly into the container.
b. Adjust standard LIONS input.ctrl and parameters.ctrl example files to work with reference pipeline.
c. Write specific responses to each point below with how we address this [ Richard / Mehdi ]
LIONS requires as input an RNA-seq dataset, reference genome, gene annotation and TE annotation files. It would make more sense to provide at least one set of these with the package. E.g. GRC38 + matching annotation files.
LIONS requires a fairly long list of dependencies to be installed. This will raise the bar of use substantially, thereby potentially limiting the reach of the tool. One possibility would be to package all the dependencies in system-specific dependency packages which could be easily downloaded by end-users.
The package uses a very old version of SAMTOOLS (0.1.18) – the current version is 1.9. They do not provide binaries for such an old version of SAMTOOLS, so users will need to build it from source, which could be difficult for some users. In general, use of such an old version could cause things to break unexpectedly since this version is no longer being maintained.
Although there are some test data provided in the GitHub repository, no BAM file is provided. The sample input file (controls/input.list) lists a file which is not present in the GitHub repository.
The program throws errors about missing files during runtime – this can be avoided by introducing a pre-run input check. This will greatly benefit the end-user by providing instant feedback about the locations and access permissions of input files – thereby allowing them to launch multiple jobs with confidence.
Show LIONS Output Example #7
[x] Complete
a. Include example output from HL/ENCODE data in the LIONS repository [Artem]
The OUTPUT.md file describes the format of the output, but it would be nice to have an example output file for inspection. It would be helpful to know how the output can be used for differential expression analysis (e.g. normal vs. cancer).
b. Include RT-PCR cross-reference file [Artem]
Page 6, line 9 and Sup Fig 3: The RT-PCR validation was not convincing, or at least not explained properly. It was unclear what transcripts were selected for the PCR validation. Also, it would be good if you also showed in Sup Fig 3 the data and calls made by LIONS in a positive and negative example based on the validation.
Page 6: The evaluation on the method is unconvincing. More details are needed. On which genes was the RT-PCR performed? How did the data compare to RNA-seq? RT-PCR tend to quantify the expression of a specific small subset of genes. This quantification is also relative, and interpretation can’t be carried over to compare with RNA-seq directly.
c. Create a cut script to include in LIONS which will take the “.lion” file and create a simple output format which can be read/understood by users easily [Artem]
Convert to GTF: The output format is unclear. “.lion” is not a “standard” output file format as is presented in the text. Define the output file format, and perhaps consider using a known extension.
@biscuit13161 When you get a chance; can you take a look at the User Guide, there is text running off the side of the page (page 3) which makes part of it un-readable.
Writing
[x] Online Version of LIONS Manuscript [ Artem ]
[x] Typo / Wording Revisions [Artem]
The sentences structure and grammar is distracting and should be revised. The action/verb is often placed at the end of the sentence, with some sentences lacking a verb altogether
In the abstract: The authors should define TE-initiated transcripts.
Page 2 first paragraph: TEs CAN contain promoters…
Page 2 second paragraph: “TE promoters can be exploited to express a protooncogene” The word “exploited” here is awkward and implies something is guiding this process.
Page 2, line 21: define ESTs
Page 2, line 27: you have to “or” in the sentence, maybe the first one should be simply a “,”
Page 2, line 31: I would replace “Notably” by “For instance”
Page 2, line 37: Needs actual comparisons here
Page 2, line 41: “quantitative” -> “comprehensive and quantitative”
Page 2, line 43: I would remove “quantitatively” since you just used that word
Page 2, line 43-47: Clearly some components of LIONS were especially designed for cancer and for sample comparison. Perhaps some of this should also be included in your abstract?
Page 2, line 55: that sentence seems to be missing a verb, “is outputted” maybe?
Page 4, line 6: A bit repetitive, you just said what the inputs were.
Page 4, line 11: “parameter-optimized”, please explain
Page 4, line 15: “ab initio assembly”, using what? Please explain
Page 4, line 21: weird character at the end of the paragraph
Page 4, line 44: weird character
Page 4, line 45: not sure I understand your “i.e. retained introns and low abundance lincRNAs”, do you mean you exclude those? Maybe rephrase a bit?
Page 5, line 22: For “drivers”: Aren’t reads only mapping to TEs depleted in the previous step?
Page 5, line 28: weird character and also underlined here.
Page 5, line 49: maybe cite Fig 1C here?
Page 6, line 5: not sure I understand why you cite Fig 1B here since it doesn’t talk about recurrence and specificity
Page 6, line 8: I would call this “Evaluation and Conclusion”
Page 6, line 18: weird character
Sup Material: not sure on the format but Sup Figs and Sup Fig captions could probably be included into a single Sup Material document
You should had a reference to the Manual somewhere in the text.
[x] Fix Figures [Artem]
Page 3, Figure 1: Figure labels are a bit confusing as-is since we can’t easily see the 3 modules. Maybe relabel A) Module 1: Input / Initialization, B) Module 2i: Detect and Classify, C) Module 2ii: Intersect…, D) Module 3: Comparison… The “East” and “West Lions”, just added a confusing layer. Or maybe you should simply talk about two main modules otherwise.
Figure 1: 1.A doesn’t show an alignment step, although the legend mentions it. Alignment vs Assembly unclear at this point. Provide more details for the description of panel “C”
[x] Polymorphic Sequences – Write Response [ Dixie ] (upload response to manuscript)
One potential limitation is the reliance on reference sequences and the TE annotations therein. This means that polymorphic TE sequences will not be considered by the software. The authors may wish to mention this, and/or provide an alternative solution for investigators who wish to consider the potential impacts of polymorphic TEs on the generation of isoforms. I’m not sure if the optional ab inito assembly step could be used in this regard.
Baseline / Simulation comparison of LIONS [ Artem ] #5
[ ] Complete
My only major concern is that there should be a comparison of the East Lions module to a baseline method. As referenced in the introduction, many studies have identified TE-derived transcripts, through various (perhaps less elegant) methods. It would be valuable for the manuscript to demonstrate how LIONS compares to current methods. For example, for the evaluated dataset, de novo transcript assembly could be performed (e.g., via Cufflinks or Stringtie) to see if any LIONS-discovered chimeric transcripts are recapitulated.
Page 4, line 53: Choice of parameters should be better justified. At minimum, you should show the impact of varying some of these thresholds and show how many calls are made on some ENCODE datasets (or other dataset) for different choices of thresholds. a. Written response of differing functions / Mapping method [ Artem ]
Mapping reads to TEs is very challenging due to multi-mapping, and although the manuscript mentions that multi-mapped reads are flagged and conserved, the authors do not mention how they are then properly mapped or used, or how this whole issue is addressed.
Prepare LIONS Input Example [ Richard ] #6
[] Complete a. Create a reference-ready ‘Fastq to Output’ container based on the first. The idea being that all reference materials (RM, hg38, etc…) are automatically downloaded and loaded directly into the container. b. Adjust standard LIONS input.ctrl and parameters.ctrl example files to work with reference pipeline. c. Write specific responses to each point below with how we address this [ Richard / Mehdi ]
LIONS requires as input an RNA-seq dataset, reference genome, gene annotation and TE annotation files. It would make more sense to provide at least one set of these with the package. E.g. GRC38 + matching annotation files.
LIONS requires a fairly long list of dependencies to be installed. This will raise the bar of use substantially, thereby potentially limiting the reach of the tool. One possibility would be to package all the dependencies in system-specific dependency packages which could be easily downloaded by end-users.
The package uses a very old version of SAMTOOLS (0.1.18) – the current version is 1.9. They do not provide binaries for such an old version of SAMTOOLS, so users will need to build it from source, which could be difficult for some users. In general, use of such an old version could cause things to break unexpectedly since this version is no longer being maintained.
Although there are some test data provided in the GitHub repository, no BAM file is provided. The sample input file (controls/input.list) lists a file which is not present in the GitHub repository.
The program throws errors about missing files during runtime – this can be avoided by introducing a pre-run input check. This will greatly benefit the end-user by providing instant feedback about the locations and access permissions of input files – thereby allowing them to launch multiple jobs with confidence.
Show LIONS Output Example #7
c. Create a
cut
script to include in LIONS which will take the “.lion” file and create a simple output format which can be read/understood by users easily [Artem]