Closed rtraborn closed 7 years ago
Hi @rtraborn
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: TSRchitect
Version: 0.99.0
Date: 2017-01-19
Title: Promoter identification from diverse types of large-scale TSS profiling data.
Description: In recent years, large-scale transcriptional sequence data has
yielded considerable insights into the nature of gene expression and regulation
in eukaryotes. Techniques that identify the 5' end of mRNAs, most notably CAGE,
have mapped the promoter landscape across a number of model organisms.
Due to the variability of TSS distributions and the transcriptional noise present
in datasets, precisely identifying the active promoter(s) for genes from these datasets
is not straightforward. TSRchitect allows the user to efficiently identify the putative
promoter (the transcription start region, or TSR) from a variety of TSS profiling
data types, including both single-end (e.g. CAGE) as well as paired-end (RAMPAGE, PEAT).
Along with the coordiantes of identified TSRs, TSRchitect also calculates the width,
abundance and Shape Index, and handles biological replicates for expression profiling.
Finally, TSRchitect imports annotation files, allowing the user to associate identified promoters
with genes and other genomic features. Three detailed examples of TSRchitect's utility are provided
in the vignette, included with this package.
Authors@R: c(person("R. Taylor","Raborn",email="rtraborn@indiana.edu",role=c("aut","cre", "cph")),
person("Volker","Brendel",email="vbrendel@indiana.edu",role=c("aut","cph")))
Author: R. Taylor Raborn [aut, cre, cph]
Volker P. Brendel [aut, cph]
Maintainer: R. Taylor Raborn <rtraborn@indiana.edu>
Depends: R (>= 3.3.0)
Imports: BiocGenerics, BiocParallel, GenomicAlignments, GenomeInfoDb,
GenomicRanges, gtools, IRanges, Rsamtools (>= 1.14.3),
rtracklayer, utils
License: GPL-3
URL: https://github.com/brendelgroup/tsrchitect
BugReports: https://github.com/brendelgroup/tsrchitect/issues
Suggests: ENCODExplorer, knitr, rmarkdown
biocViews: Clustering, FunctionalGenomics, GeneExpression,
GeneRegulation, GenomeAnnotation, Sequencing, Transcription
VignetteBuilder: knitr
RoxygenNote: 5.0.1
NeedsCompilation: no
Packaged: 2017-01-18 15:20:28 UTC; rtraborn
I would recommend separating the implementation of the TSR methodology from the way the data are managed. The user should be able to load data using standard import tools, integrate the datasets into higher-level objects, and feed them to the algorithms without relying on an API based on side-effects. Particularly troubling is the initializeExp()
function that assigns into its parent frame.
Your package has been approved for building. Your package is now submitted to our queue.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/TSRchitect_buildreport_20170120134038.html
Hello @vobencha @bioc-issue-bot:
We have updated our code base to address the issues that triggered the build error(s) above. Please let me know what we need to do to restart the automatic build process from 1/20.
Thanks, Taylor
Hi,
I've done a first review. There is quite a bit of re-factoring that needs to be done related to #3.
(1) Please format code to >=80 characters wide.
(2) parallel evaluation
Please use either BiocParallel or doParallel. Currently you have a mix of the two. Since you are building on the infrastruture in GenomicAlignments I'd reccomend using BiocParallel.
(3) TSR methods vs data management
Please address Michael's comments in https://github.com/Bioconductor/Contributions/issues/256#issuecomment-274140333
I agree with him and think TSRchitect could focus on developing methods for TSR data, e.g., define bamtoTSS() and processTSS() methods for a GAlignments, GALignmentsList or GAlignmentPairs objects. I think you can re-use existing classes and functions to read in and manage the data, e.g., no need to create an importBam(). If you have specific needs that can't be met by one of the existing classes please explain. Maybe we can find a better solution or you can extend an existing class, etc.
(4) S4 objects
Once the accessors for an S4 class are defined all other code should use the accessors and not '@' for slot access. For example, you should use a getFileNames() instead of '@' in this vignette code chunk:
#obtaining the list of bam files loaded to the tssObject S4 object
Hs_RAMPAGE@fileNames
Please make these changes throughout the code.
(5) tar ball size
The source package resulting from running R CMD build should occupy less than 4MB on disk. See package guidelines: http://www.bioconductor.org/developers/package-guidelines/#correctness
TSRchitect is 15 MB:
~/software >du -h TSRchitect_0.99.0.tar.gz 15M TSRchitect_0.99.0.tar.gz
(6) vignette
i) evaluation The vignette is essentially not evaluated. I'm guessing this is due to time constraints. As written, it's takes much more than the <=5 minute guideline. To be accepted into the Bioconductor repo you need an evaluated vignette. You'll need to find smaller data files for the examples, options are to look in the annotation and/or experimental data repo:
http://www.bioconductor.org/packages/release/BiocViews.html#___ExperimentData http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData
or use the AnnotationHub package. If none of those work, you can make an experimental data package of subsets of the RAMPAGE datasets.
ii) symbolic links This type of file system manipulation is not cross-platform compatible. Please use tempdir() and tempfile() for a place to download and manage files in an R session and remove this section:
#The following are to be executed in your shell terminal:
mv *.bam downloads
cd downloads
cd HsRAMPAGEbam
ln -s ../downloads/ENCFF214GWH.bam H460-rep1.bam
ln -s ../downloads/ENCFF265SGZ.bam H460-rep2.bam
ln -s ../downloads/ENCFF242UWH.bam HT1080-rep1.bam
ln -s ../downloads/ENCFF348EKW.bam HT1080-rep2.bam
cd ..
iii) annotation The vignette states the RAMPAGE data sets were aligned to GRCh38 but the annotation you use is v 19 which I assume corresponds to hg19:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz
AnnotationHub has several GRCh38 files available:
> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2017-01-05
> query(hub, c("gencode", "gff", "human"))
AnnotationHub with 9 records
# snapshotDate(): 2017-01-05
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49554"]]'
title
AH49554 | gencode.v23.2wayconspseudos.gff3.gz
AH49555 | gencode.v23.annotation.gff3.gz
AH49556 | gencode.v23.basic.annotation.gff3.gz
AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz
AH49560 | gencode.v23.polyAs.gff3.gz
AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz
AH49562 | gencode.v23.tRNAs.gff3.gz
Let me know if you have questions. Valerie
Hello,
Thank you for your helpful review of our package.
Items 1), 2), 4) and 5) are no problem; we'll be able to address these without problem.
To the points you raised in 3), which are closely related to those of @lawremi in part (although we will respond to his comments separately for the sake of clarity), we will share our perspectives:
We recognize the utility of the existing S4 objects (e.g. GAlignments and GAlignmentsList); in theory some of our code could be converted to methods for these objects, rather than TSRchitect's tssObject S4 class. We developed the workflow as it currently exists: i) to facilitate the retrieval and storage of TSS profiling datasets prior to clustering and ii) to avoid pitfalls that may arise from analyzing datasets of class e.g. GAlignment. To these ends:
i. In our experience, the majority of use cases the user's TSS profiling datasets are in BAM format, so we developed this version of TSRchitect relying on data from .bam files. However, since our package is intended for analysis of multiple TSS profiling methods, (CAGE, RAMPAGE, PEAT, full-length cDNAs and ESTs), we want TSRchitect to retain the flexibility to import other formats as well in future versions. We therefore envision multiple paths leading to the slots @tssTagData, although not all of them will originate as BAM files; others will be BED and GFF3 formats. A major role of the fileNames and bamData slots are to maintain data provenance; what is the source of the data and how did one arrive at the TSS positions in found on the tssData slot. Given this scope, operating on GAlignments and GAlignmentsList S4 objects would work for some, but not all, potential use cases. There are certain operations in particular that we need to perform on the bam files upstream of their being imported to our tssObject; these relate to the selection/exclusion of specific flags. (see our invocations of scanBamFlag() in importBam()). While CAGE libraries are single-end, RAMPAGE and PEAT are paired-end, and in this latter case we want to extract only the properly-paired reads: only these have evidence connecting the TSS to a downstream transcript body. When designing TSRchitect, we reasoned that making these specifications and encoding them in scanBamFlag() prior to importing the .bam file was the best option, and they lead to attachment of data in a uniform format on the @bamData slot.
ii. A concern we have about asking users to compute on GAlignments or related S4 objects rather than slots on our tssObject S4 class is that we want to the package to direct import of their experimental data in its entirety at the outset. TSS profiling experiments are typically done in replicate, and they frequently include different conditions, stages or tissues. Our importBam/setSampleID workflow was envisioned as a means to address that use case by importing all of the data together in a replicate/sample-aware manner, but we are open to suggestions on how to improve this. Please let us know what your thoughts are on the above and what you might suggest we do. I apologize for the length of these remarks, but we wanted to provide some information on our objectives for the package.
Finally, concerning your comments in point 6 about our vignette. If we understand correctly, it is the vignette code that must be evaluated within 5 minutes. If so, we should be able to make this work. Are we permitted to have multiple vignettes? If so, we are envisioning a <= 5 minute vignette followed by other, vignettes that take longer to evaluate. Would these then be specified in separate documents in /vignettes? The purpose of these multiple vignettes would be to feature different a different data type for each. Thank you for the nice suggestion about using the AnnotationHub package- this will make things much easier.
Thank you again for your review- it was very helpful.
Taylor
There's nothing wrong with providing conveniences, as long as it's possible to construct instances of your tssObject class from standard Bioconductor objects, without enforcing, e.g., a particular set of files on the file system.
Understood. Would a modification of initializeExp() to permit direct addition of GAlignment objects to a new slot (or a renamed bamData slot) when constructing the tssObject address this? This would give the option of importing the alignment (i.e. .bam) files, but would add the functionality of attaching GAlignment objects from the workspace also.
Yes, that would work. Of course, initializeExp()
has its own issues.
Yes, you can have multiple vignettes. The idea is that key functions are evaluated at least once in the man pages or vignette so the nightly builds can detect a problem with the function or dependencies early on.
The command R CMD check --no-build-vignettes must complete in <=5 minutes. This runs all vignettes in vignettes/. You need at least one 'live' vignette that is evaluated and completes in the required time. If you include others that aren't evaluated you should have a disclaimer at the top of the vignette and point them to the one that is evaluated. Users generally try to walk through the vignettes and if you have some that are static and never tested the likelihood of them 'not working' is high.
Another option is to make a workflow. These are longer running examples and intended to be closer to a real analysis: http://www.bioconductor.org/help/workflows/
Valerie
Hi @rtraborn , We're swapping reviewers and @lawremi will be completing this review. Please let him know when you've addressed the issues above and you're ready for him to take another look. Thanks. Valerie
Hi @vobencha : Thanks for letting me know, and for all of your help up to this point. During the past two weeks we've made a number of updates on our devel branch to address the concerns that you and @lawremi raised, and have one small change left to make before we are ready for another review. I'll let you know when this is ready. Thanks again, Taylor
Hello @lawremi:
We recently pushed the requested TSRchitect
updates to our branch master, so we're ready for it to be given another look when you have the chance.
Thanks again for your help.
Cheers,
Taylor
It looks like you still need to address the side-effect issues. initializeExp()
should not make assignments in the user's workspace. The entire API needs substantial refactoring to become idiomatic R. One approach would be to have constructors for individual components and then assign the objects with setters (foo<-
) into the umbrella object.
Your concerns with initializeExp()
are well taken.
As we understand it, the problem is that experimentName
is assigned in the user's workspace. Would this problem be avoided by invoking the function in the following way (e.g.):
foo <- initializeExp(expTitle, isPairedEnd)
where a newly-constructed object, foo
, of class tssObject
is created?
Alternatively, could we use our recently-added function tssObject() -> foo
and add other constructors to populate the slots as needed?
Also, could you suggest a good example of an S4 constructor method from a Bioconductor package? We found DGEList()
, from the edgeR package, found here, that seems to be consistent with the idiomatic R practices you mention.
Yes, these would be appropriate directions. You can also look at e.g. IRanges or GenomicRanges for inspiration.
Those examples were helpful- thank you.
We overhauled the API for initializing a tssObject
, and eliminated the use of side-effects.
To this end, initializeExp
and two other functions were replaced by a new constructor function, loadTSSobj
.
These changes, along with corresponding updates to the documentation, were pushed to our branch master, so it's ready for it to be given another look at your convenience.
Thanks again,
Taylor
Ok, great, this a step in the right direction. The next phase should be improving integration with existing Bioconductor data structures. The high-level workflow object (tssObject
) is useful for running end-to-end, starting at the file system, but the user will want more flexibility.
See the BamViews object in Rsamtools. It can hold a list of BAM files with sample metadata. One could also imagine using a BamFileList and its mcols()
for the metadata. The name of the bamToTSS()
function is confusing. It's effectively loading the reads into GRanges objects? In what situation would the user run that without calling processTSS()
(which should probably countTSS()
)?
What is the benefit of having so many separate steps in the workflow? If there are decision points, the vignette does not make this clear. You could use the parameter object pattern and have an object that holds all the options for the workflow. The user creates it and passes the object to a high-level function that carries out many of these steps automatically. This would help, for example, with determineTSR()
and addTagCountsToTSR()
since they appear to share some parameters.
Steps like importAnnotationExternal()
seem unnecessary. Why not have the script use standard Bioconductor import routines and pass the annotation to addAnnotationToTSR()
? If your goal is to track provenance, it's probably most pragmatic to just save the script, not try to capture everything at runtime, at least until we figure out how to capture provenance in the standard data structures.
Btw, I glanced at the code. The use of the right arrow (->
) makes the code really hard to read. Would you please convert everything to use <-
?
I am getting rather confused about the procedure of getting our code/workflow out to the community. We have spent many hours now re-writing the code in this way and that, all the while checking that it produces exactly the same output as before. At what point can we declare this is useful as it is - following the simple steps explained in our documentation and you will get the results you want (as a biologist)? We have analyzed published data sets (CAGE, PEAT, RAMPAGE), we have analyzed our own new data sets - and all of these give correct and highly interesting results, as intended. Obviously there may be further development to be done on the coding end (including our envisioned input from processed cDNA/EST data), but I am completely convinced that the current code as is serves its purpose for genomics research.
I have had a lot of experience publishing bioinformatics software, and a lot of this, frankly, has been quite useful to the community. My friends in the software engineering community often comment of their plight of ever changing targets.
Thus, can we please agree on a path to make our code available to the community as soon as possible? We'd be happy to continue working on refinements of the code base, but in the meantime we'd like to get this out to the biologists needing this. I think they'd be more interested in getting results from their data than whether they should import alignments in this way or that.
This is not meant to take away from the useful comments now and in the past - clearly our code has been improved because of that. But I would like to argue in favor of serving the biology user community, which can get very useful results with the current code, with minimal effort on their end providing input to the workflow.
There are plenty of ways of making code available to the community outside of the Bioconductor project. Bioconductor's goal is interoperable software. All submissions are held to similar levels of scrutiny, despite how useful they might be. Thanks for making the improvements so far. Perhaps it would be easier to make these changes if there were automated tests in place.
The vignette shows a workflow starting from files, rather than Bioc data structures, and ends with a call to getTSRData()
, the return type of which is essentially undocumented. Despite it requiring the user to refer directly to slots in the object (a risky exposure of an implementation detail), the slots themselves are of type "list", so it is really hard to figure out what they are without actually running the code. It looks like they are supposed to be data.frames. Why not return a GRanges, since they are regions? As it is, it's files in, files and data.frames out. Where is the integration? Also, the vignette should describe the structure of the results, and give some examples of interpreting them.
Michael: We are all for adherence to standards. The problem I see is one of communication. We don't understand precisely what changes you consider necessary. We use a lot of Bioconductor packages, and in our code development we have sought to follow similar protocols as in packages we use. For example, the input function loadTSSobj() is modeled closely after methRead() in backbone.R of the methylKit package - same idea: you collect input from files, provide some annotation, and the function returns an object to be worked on. What's wrong with that?
Concerning the examples, we have made the adjustments to be able to run the vignette example in less than 5 minutes. The example does not end with a call to getTSRdata, but with the interpretation of the results. I would agree that this looks boring, but our understanding was that we should only have a minimal example in the vignette to show the mechanics of the program. To show our real examples (which we had moved from our original vignette), we use a trick from the edgeR package and provide the function TSRchitectUsersGuide(), which brings up realistic usage examples. Again the examples end with interpretation. Furthermore, the output of TSRchitect can feed into programs liked edgeR (which again has prominent user examples of reading input from files precisely as TSRchitect puts out) for differential expression analysis.
So could we agree on a plan forward? Shall we add more detail to the TSRchitectUsersGuide to show the connection to edgeR, for example? And/or discuss more precisely how TSRchitect detects promoters, their association with known genes, shape index, alternative promoter usage, etc.? All of that is of course in our manuscript to be submitted. Our hope is to get the code available in Bioconductor to make it easy for readers/users to install and use the code, and to connect it to other Bioconductor packages they may be familiar with.
Thanks, Volker
Necessary changes:
<-
instead of ->
, which really obfuscates the code.tssObject()
should not tell the user that it created an object in the workspace, since it no longer does that.import(BiocGenerics)
and import(methods)
rather than worrying about importing individual symbols from those two packages.{()
in the body, because the methods package will take that as a non-standard generic.match.arg(slotType)
, the other arguments are redundant.vignette()
to implement TSRchitectUsersGuide()
; not sure if that function is even necessary, compared to a direct vignette()
call.addAnnotationToTSR()
: idvec
by mcols(regionOfInterest)[[featureColumnID]]
, not by eval()
.message()
instead of cat()
, except in show()
methods.makeGRangesFromDataFrame()
where appropriate for clarity.bamToTSS()
: use lapply()
instead of for()
there; why not just do granges()
on the GAlignments to get the GRanges?tssChr()
: Use split()
to create the list of starts split by strand. In fact, acquireTSS()
should just split()
by the combination of sequence and strand.countsToVector()
: this entire function could be implemented by one call to rep()
.writeTSR()
should use rtracklayer for writing the BED file.mergeSampleData()
should use lapply()
and do.call(rbind, .)
instead of for()
.prcTSS()
should just pass writeTable
to writeDF
arg of tagCountTSS()
.tsrToDF()
: replace outer for()
with lapply()
and do.call(rbind,.)
, inner for()
calls are probably vectorizable, with some help from vapply()
. They should not put the data into a matrix (and thus require later coercion). Instead, compute the columns in vectorized fashion and pass to data.frame()
.processTSS()
creates a MulticoreParam but does not pass it to bplapply()
. Why condition on n.cores
there? bplapply()
should do the right thing with one core. The same holds true for determineTSR()
. It should compare writeTable
to TRUE
not "TRUE"
.importAnnotationExternal()
does not need to condition on the fileType
; just pass it as the format
argument to rtracklayer::import()
. This begs the question though: why not just have the user do the importing (or come up with their own GRanges, not directly from a file) and give the user a setter for the annotation
slot? Basing everything on files completely misses the point of Bioconductor.Dear Michael: Thank you for your careful review of our code, and for the detailed suggestions that you provided, which have already improved TSRchitect in many ways. We have completed a thorough update of TSRchitect to address the items you raised during your most recent code review. Briefly, the changes are as follows:
cat()
were replaced by message()
(excepting show()
methods)lapply
in many instances->
were changed to <-
)do.call(rbind, ...)
was inserted into our code where appropriatewriteTSR()
now uses the function export.bed
from rtracklayercreateSummarizedExperiment()
, was added, which creates a SummarizedExperiment object from selected TSS count data. Thanks again for your help, and please let me know if you have any questions about our most recent changes.
Best regards,
Taylor
Thanks for addressing my concerns. Please setup the hook so that the issue bot can perform the tests. If everything builds and checks out, I'll mark this as accepted.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/TSRchitect_buildreport_20170418161416.html
Hi Michael: Great- thanks so much! The hook has been set up to facilitate automatic builds. Best regards, Taylor
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/TSRchitect_buildreport_20170419131348.html
Hi Michael:
The latest build appears to have been successful; the only warning (triggered by BiocCheck
: WARNING: Update R version dependency from 3.3.0 to 3.4.
) is puzzling, because the latest version of R is 3.3.3 (suspect this is referring to the version of Bioconductor).
I could remove this line entirely from DESCRIPTION
, but will leave it as is for the time being.
Thanks,
Taylor
It is indeed referring to the version of R. I think the rationale is that it's only been tested on R 3.4, so to be safe, the dependency should be there. I'm not a big fan of that one.
Thanks for clarifying. So if I add Depends: R (>= 3.4)
to the DESCRIPTION
, what would that mean for users? I'm not currently using 3.4 (I'm using 3.3.3), so I'm not quite sure how to proceed if I'm releasing something that depends on a version that I haven't tested.
Well ideally (1) you would be running R devel, as that is an expectation of maintaining a package in the current Bioc devel, and (2) you would have automated tests that would be running on the Bioc build machines (running 3.4 in devel). Since those two things are not true, you are at risk, and BiocCheck is pointing that out.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/TSRchitect_buildreport_20170419155539.html
Hi Michael: Your points are well taken. I went ahead and updated the package with R 3.4 and TSRchitect built on all three systems without issue. Best regards, Taylor
Thanks, I've marked this as accepted.
Excellent. Thank you so much for your help during this review.
Dear Bioconductor Team:
We submit our package, TSRchitect, for inclusion in the Bioconductor package repository.
Best regards,
Taylor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.