Closed rujinwang closed 4 years ago
Hi @rujinwang
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: SCOPE
Type: Package
Title: A normalization and copy number estimation method for single-cell DNA sequencing
Version: 0.99.0
Author: Rujin Wang, Danyu Lin, Yuchaojiang
Maintainer: Rujin Wang <rujin@email.unc.edu>
Description: Whole genome single-cell DNA sequencing (scDNA-seq) enables characterization of copy number profiles at the cellular level. This circumvents the averaging effects associated with bulk-tissue sequencing and has increased resolution yet decreased ambiguity in deconvolving cancer subclones and elucidating cancer evolutionary history. ScDNA-seq data is, however, sparse, noisy, and highly variable even within a homogeneous cell population, due to the biases and artifacts that are introduced during the library preparation and sequencing procedure. Here, we propose SCOPE, a normalization and copy number estimation method for scDNA-seq data. The distinguishing features of SCOPE include: (i) utilization of cell-specific Gini coefficients for quality controls and for identification of normal/diploid cells, which are further used as negative control samples in a Poisson latent factor model for normalization; (ii) modeling of GC content bias using an expectation-maximization algorithm embedded in the Poisson generalized linear models, which accounts for the different copy number states along the genome; (iii) a cross-sample iterative segmentation procedure to identify breakpoints that are shared across cells from the same genetic background. We evaluate performance of SCOPE on real scDNA-seq data sets from cancer genomic studies. Compared to existing methods, SCOPE more accurately estimates subclonal copy number aberrations and is shown to have higher correlation with array-based copy number profiles of purified bulk samples from the same patient. We further demonstrate SCOPE on three recently released data sets using the 10X Genomics single-cell CNV pipeline and show that it can reliably recover 1% of the cancer cells from a background of normal.
Depends: R (>= 3.6.0), GenomicRanges, IRanges, Rsamtools, BSgenome.Hsapiens.UCSC.hg19, GenomeInfoDb
Imports: stats, grDevices, graphics, utils, DescTools, RColorBrewer, gplots, foreach, parallel, doParallel, DNAcopy, BSgenome, Biostrings, BiocGenerics
Suggests:
knitr,
rmarkdown,
WGSmapp,
testthat (>= 2.1.0)
VignetteBuilder: knitr
biocViews: SingleCell,
Normalization,
CopyNumberVariation,
Sequencing, WholeGenome,
Coverage,
Alignment,
QualityControl,
DataImport
License: GPL-2
LazyData: true
RoxygenNote: 6.1.1
Encoding: UTF-8
AdditionalPackage: https://github.com/rujinwang/WGSmapp
Can't build unless issue is open and '2. review in progress' label is present, or issue is closed and 'TESTING' label is present.
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
AdditionalPackage: https://github.com/rujinwang/WGSmapp
Dear @rujinwang ,
You (or someone) has already posted that repository to our tracker.
See https://github.com/Bioconductor/Contributions/issues/1239
You cannot post the same repository more than once.
If you would like this repository to be linked to issue number: 1242, Please contact a Bioconductor Core Member.
Dear Bioconductor Core Member @hpages , I would like this repository to be linked to issue number #1239. The software package SCOPE has the companion experiment data package WGSmapp used for illustrative purposes in the vignette. Otherwise the build of SCOPE will get ERRORs. Could you please help me on this? Thanks.
Rujin
Hi Rujin,
Thanks for your submission.
You don't need to open a new issue for companion data packages so please close issue #1239.
All you need to do is add the companion package as an additional package to the issue you already opened for the software package (i.e. this issue). Which is what you did 2 days ago (see https://github.com/Bioconductor/Contributions/issues/1242#issuecomment-529566331).
Looks like our Single Package Builder (SPB) failed to install WGSmapp before trying to run R CMD build
on SCOPE. Not sure why. Maybe @lshep can chime in. But before that, please close issue #1239 and bump SCOPE version (to 0.99.1) to trigger a new build by the SPB and we'll see what happens.
Thanks, H.
Received a valid push; starting a build. Commits are:
b129572 bump to version 0.99.2
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hi Herve @hpages , Thanks so much for your reply. I just closed issue #1239 and triggered a new build. It seems the companion data package WGSmapp fails to be built before running SCOPE. In the previoulsy opened issue #1239 , WGSmapp can be built without any errors. Here I directly use library(WGSmapp) in the vignette of SCOPE. Do I need to use devtools::install_github("rujinwang/WGSmapp") instead?
best, rujin
Do I need to use devtools::install_github("rujinwang/WGSmapp") instead?
No because it's considered bad practice to install packages in a vignette. And having a vignette that installs stuff from GitHub is even worse.
I guess it's time to ask help from @lshep . Lori?
Thanks!
Lori (@lshep) contacted me privately to let me know that she'll take care of this tomorrow. Thanks for your patience.
Sounds good. Thank you for your help!
I've kicked off a manual build of this package and I believe corrected it in our database to be associated with this issue number. Now as long as the webhook is set up on BOTH repositories - either one should trigger a build on version bump and report the status here. Note: it will only build for the package that the version bump occurred not both package. This should also allow both packages to find each other. If there are any further issues with this please let me know - otherwise I'll leave it back to @hpages to do a formal review of your packages when the ERRORS are cleared.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
ddf5a61 bump to version 0.99.3
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Yes, ERRORs are all cleared now! Thank you @Ishep and @hpages .
Received a valid push; starting a build. Commits are:
96d59b4 add .bed file for hg38
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
43af6bd add seg regions for hg38
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
aa7dec6 incorporate hg38gaps.txt
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
1afef1f Enable user-define bin length and offer SoSplot
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Hi @rujinwang ,
I see you've been actively working on SCOPE and the package now passes R CMD check
and R CMD BiocCheck
without errors or warnings. Thanks for that. Please let me know if the package is now ready for review.
Best, H.
Hi @hpages , Yes, please go ahead to review the package. Thanks!
Best, Rujin
Hi Rujin,
I've taken a first look at the package and it has usability and reliability issues that are too serious for inclusion in Bioconductor. I bumped into these issues at the very beginning of your workflow, only after using the first two functions used in the workflow (getbambed_scope
and getmapp
). I could keep going e.g. getgc_scope
has issues similar to getmapp
and so on... (and why would one function be suffixed with _scope
and not the other one?) but I'll stop here because my time is limited. If you want to pursue this submission, a lot of work will need to be done on the package to make it more usable and more reliable.
Regards, H.
The "1.2 Bioinformatic pre-processing" section of the vignette shows the use of command-line tools java -jar picard.jar
and split_script.py
. Are these tools available somewhere? Are they documented? What do they do exactly? If the user doesn't have access to them, what is the point of this section?
Your function naming style is very inconsistent, even for the user-facing functions: you use camel caps (getGini
), snake case (getgc_scope
), dot-separated (getcoverage.scDNA
), other unidentified styles (e.g. choiceofT
). Please choose one style and stick to it. Also, a name like getbambed_scope
is poor style. It's considered good practice to separate all the words, not just some of them. So a better name would be something like get_bam_bed_scope
(snake case) or getBamBedScope
(camel caps).
The getbambed_scope
function is poorly implemented. If you want to generate bins on a given genome, use tileGenome()
from the GenomicRanges package:
bins <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg19), tilewidth=500000, cut.last.tile.in.chrom=TRUE)
bins
# GRanges object with 6343 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-500000 *
# [2] chr1 500001-1000000 *
# [3] chr1 1000001-1500000 *
# [4] chr1 1500001-2000000 *
# [5] chr1 2000001-2500000 *
# ... ... ... ...
# [6339] chrUn_gl000245 1-36651 *
# [6340] chrUn_gl000246 1-38154 *
# [6341] chrUn_gl000247 1-36422 *
# [6342] chrUn_gl000248 1-39786 *
# [6343] chrUn_gl000249 1-38502 *
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
If tileGenome()
doesn't cover you needs, you are welcome to provide your own function (could be a wrapper around tileGenome()
) but please note that: (1) using a precomputed .bed
file for that isn't necessary and is a bad idea, and (2) creating one bin at a time in a while
-loop is also a bad idea because it is very inefficient.
Having functions defined in your package referencing variables defined in the global environment is bad software design:
getmapp(ref_raw, genome=BSgenome.Hsapiens.UCSC.hg38)
# Error in getmapp(ref_raw, genome = BSgenome.Hsapiens.UCSC.hg38) :
# object 'mapp_hg38' not found
The design of the getmapp()
function makes no sense at all. The user is required to pass a huge BSgenome object to it only to allow the function to know which mappability track to pick up. Why not just pass the mappability track instead? Then you wouldn't have to pick up the mappability track from the global environment. Another advantage to this design is that getmapp()
then would work for an arbitrary genome, not just hg19 or hg38, as long as the user has access to the mappability track for the genome.
Do you realize that the BAM files you use in the "2.1 Pre-preparation" section of the vignette contain alignments against genome assembly hg38 so when you do
bambedObj <- getbambed_scope(bamdir = bamdir, sampname = sampname_raw)
without specifying hgref="hg38"
you get the wrong binning? Also later when you do
data("mapp_hg19")
mapp <- getmapp(ref_raw, genome = BSgenome.Hsapiens.UCSC.hg19)
you get the wrong mappability?
This result is incorrect:
getmapp(GRanges("chr1:10022-10031"), BSgenome.Hsapiens.UCSC.hg19)
# Getting mappability for chr1
# [1] 0.65625
Should be 0.75 (half of the range has a mappability score of 0.5 and the other half a mappability score of 1).
Add DNASeq to the biocViews field.
Dataset mapp_hg19
contains the chromosome lengths but not dataset mapp_hg38
:
seqlengths(mapp_hg19)
# chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16
# 249250621 135534747 135006516 133851895 115169878 107349540 102531392 90354753
# chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3
# 81195210 78077248 59128983 243199373 63025520 48129895 51304566 198022430
# chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX
# 191154276 180915260 171115067 159138663 146364022 141213431 16571 155270560
# chrY
# 59373566
seqlengths(mapp_hg38)
# chr1 chr20 chr21 chr15 chr8 chr10 chr9 chrX chr3 chr12 chr2 chr11 chr7
# NA NA NA NA NA NA NA NA NA NA NA NA NA
# chr13 chr18 chr5 chr22 chr14 chr4 chr17 chr16 chr6 chr19 chrM chrY
# NA NA NA NA NA NA NA NA NA NA NA NA
Any reason for that?
Coding style: you sometimes use <-
and sometimes =
for assigment. Please always use <-
for assignment (with a space before and after the left arrow).
Received a valid push; starting a build. Commits are:
91f1834 fix NAs in seqlengths(mapp_hg38)
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
35905ff update R version Dependency to 3.6
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
bdcbc90 version bump 0.99.6
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Hi @hpages , Thanks a lot for your review and constructive suggestions. I've made changes accordingly and bumped the package to a new version. Regarding some of your concerns, please see my responses below.
Some examples of usability issues
The "1.2 Bioinformatic pre-processing" section of the vignette shows the use of command-line tools java -jar picard.jar and split_script.py. Are these tools available somewhere? Are they documented? What do they do exactly? If the user doesn't have access to them, what is the point of this section?
The picard.jar is a public available tool for manipulating high-throughput sequencing data, released by Broad Institute. The only self-developed script is split_script.py and now it is pre-stored in the package (inst/docs folder). The aim of this section is to show how the pre-processing bioinformatic pipeline works for users to get preparation of the input for SCOPE (demultiplexed .bam files).
Functions have been renamed using snake case. Now the naming style is consistent.
For the design of function get_mapp()
, it aims to compute mappability for each bin. mapp_hg38
or mapp_hg19
are large-size GRanges Objects pre-stored in the dependent dataset package WGSmapp
, rather than variables defined in the global environment. We developed the dataset package WGSmapp
in particular to offer mappability tracks of whole-genome sequencing.
library(WGSmapp)
data("mapp_hg38")
getmapp(ref_raw, hgref = "hg38")
These pre-stored mappability tracks are downloaded from the ENCODE Project thus can not be directly passed to a function. I'm now passing a simple character indicator, instead of a huge BSgenome object, to allow the function to know which mappability track to pick up. Do you have any comments or suggestions on how to do it in a more efficient way?
Some examples of reliability issues
Do you realize that the BAM files you use in the "2.1 Pre-preparation" section of the vignette contain alignments against genome assembly hg38
How did you realize these BAM files are against genome assembly hg38? I downloaded them from 10XGenomic website, which is public available. Given the information provided here, it's against assembly GRCh37/hg19.
This result is incorrect:
getmapp(GRanges("chr1:10022-10031"), BSgenome.Hsapiens.UCSC.hg19) # Getting mappability for chr1 # [1] 0.65625
Should be 0.75 (half of the range has a mappability score of 0.5 and the other half a mappability score of 1).
The mappability is computed as weighted average of the mappability scores if multiple ENCODE regions overlap with the query genome region. It's not weighted by the width of the query genome region, but rather by the width of mappability track bins. In other word, ranges that only hit regions with mappability scores of 0.5 and 1 will have the same mappability scores, whatever the width of each sub-region is. For example, in your scenario, mappability = (0.511+15)/(11+5) = 0.65625.
Some other minor issues
mapp_hg19
and mapp_hg38
contain chromosome lengths nowYes, we are pursuing the submission. Please feel free to let me know if further modifications or improvements are needed. Thank you for your help!
Best, Rujin
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Please don't be alarmed about all the build report. And sorry for the noise. We are updating the builders to R 4.0 and Bioc 3.11 - Your package has some special cases that are triggering ERRORs on our end in windows - Unfortunately we don't have a good test case to run separately besides kicking off new builds of your package - Again we apologize for the noise and hope to have it remedied soon.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hi Rujin,
Thanks for the improvements to the package.
mapp_hg38
ormapp_hg19
are large-size GRanges Objects pre-stored in the dependent dataset package WGSmapp, rather than variables defined in the global environment.
But when you do data("mapp_hg38")
you load the mapp_hg38
dataset in the global environment. So now you have a variable mapp_hg38
defined in the global environment. And your get_mapp()
function expects this variable to be there. If the variable is not there, get_mapp()
won't work (it will fail).
This is fragile. This is why having a function defined in your package referencing variables defined in the global environment is considered bad software design. In other words, get_mapp()
should always be able to find the mapp_hg38
dataset (whether the user previously did data("mapp_hg38")
or not).
These pre-stored mappability tracks are downloaded from the ENCODE Project thus can not be directly passed to a function.
Why can't they? You've saved them as GRanges objects. Why couldn't you pass a GRanges object to a function? E.g.:
data("mapp_hg38")
get_mapp(ref, mapp_hg38)
As easy as that!
How did you realize these BAM files are against genome assembly hg38?
I looked at the chromosome lengths:
> seqinfo(BamFile(bambedObj$bamdir[[1]]))
Seqinfo object with 196 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
chr1 248956422 <NA> <NA>
chr2 242193529 <NA> <NA>
chr3 198295559 <NA> <NA>
chr4 190214555 <NA> <NA>
chr5 181538259 <NA> <NA>
... ... ... ...
chrUn_KI270742v1 186739 <NA> <NA>
chrUn_GL000216v2 176608 <NA> <NA>
chrUn_GL000218v1 161147 <NA> <NA>
chrEBV 171823 <NA> <NA>
hs38d1 10560522 <NA> <NA>
Those are the chromosome lengths of hg38 (see seqinfo(getBSgenome("hg38"))
), not hg19.
It's not weighted by the width of the query genome region, but rather by the width of mappability track bins.
So if you have a range of 1 million bases where only the 1st position overlaps with a bin of mappability 0.5 and the 999999 other positions overlap with a bin of mappability 1, and the two bins have the same size, you're considering that the mappability of your range is 0.75? Even though 99.9999% of the range has a mappability of 1? How does that make sense?
You're not making good use of argument default values. For example why set the default value for the hgref
argument to NULL
(in get_bam_bed()
and get_mapp()
) when the effective default is "hg19"
? Or why set the default value for the resolution
argument (in get_bam_bed()
) to NULL
when the effective default is 500?
Don't put an exclamation mark (!
) at the end of your error messages.
Best, H.
Received a valid push; starting a build. Commits are:
b18823b use lazy-loading
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.