Closed Liubuntu closed 7 years ago
Hi @Liubuntu
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: SeqSQC
Title: A bioconductor package for sample quality check with next generation sequencing data
Version: 0.99.0
Authors@R: person("Qian", "Liu", email = "qliu7@buffalo.edu", role = c("aut", "cre"))
Description: The SeqSQC is designed to identify problematic samples in NGS data,
including samples with gender mismatch, contamination, cryptic relatedness, and
population outlier.
biocViews: Experiment Data, Homo sapiens Data, Sequencing Data, Project1000genomes
Depends:
R (>= 3.3.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
VignetteBuilder: knitr
Imports:
e1071,
GenomicRanges,
GGally,
gdsfmt,
ggplot2,
IRanges,
methods,
rbokeh,
RColorBrewer,
reshape2,
rmarkdown,
S4Vectors,
SNPRelate
Suggests:
BiocStyle,
knitr
Your package has been approved for building. Your package is now submitted to our queue.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170828161653.html
Please work on correcting the ERRORs, Warnings, and Notes from the build report. Also please ensure that a webhook is initialized so that on subsequent version bumps a new build of the package will be generated.
Received a valid push; starting a build. Commits are:
9f5a740 R CMD check/BiocCheck errors and warnings.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170829112019.html
Received a valid push; starting a build. Commits are:
5d17303 subscribe to bioc-devel mailing list
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170829114042.html
Hi Lori,
I have just build the "SeqSQC_0.99.2" and have passed check and BiocCheck with no error or warning. Please help me with the next step. Thank you!
Qian
Hi Qian, Some comments/questions below:
BUILD REPORT:
?globalVariables
in R for helpVIGNETTE:
source("https://bioconductor.org/biocLite.R")
biocLite("SeqSQC")
GENERAL QUESTION: Could you do a quick comparison to genotypeeval? They may not be similar at all but also does some sort of QC with VCF.
MAN:
R Code:
Please address the above issues and when you are ready for a second review please comment back here with what has been updated or why things haven't and/or some clarification on above concerns.
Cheers Lori
Received a valid push; starting a build. Commits are:
e180459 added zzz.R
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170915232453.html
Please clean up the R directory. It looks like you uploaded some temp files into the directory (#LoadVfile.R#, .#LoadVfile.R , zzz.R~ ) .
Received a valid push; starting a build. Commits are:
9ce2d4c add test units; modified NEWS.
HI Lori,
I just pushed a new version of 0.99.4. Here are answers to 2 of your review questions:
<1>. COMPARISON with "genotypeeval". The genotypeeval could evaluate sample quality with the VCF files (in batch). By loading some reference “gold” standards such as 1000 Genomes or dbSNP, it evaluates the samples for ti/tv ratio in and out of the coding regions, and the total # of calls (or specific # of homozygous/heterozygous alternative calls), and ancestry estimation, and etc.. However the package of "genotypeeval" is different from "SeqSQC" in these 2 major ways: 1. "genotypeeval" takes vcf files as input, and processes directly with the vcf files. Given that size vcf files from NGS (whole-exome / whole-genome) are in Gbs for human genome, it would be not efficient, especially when processing with batch vcf files. SeqSQC take the vcf files as input, merge it with the benchmark data we have assembled from 1000 Genomes Project, and save the merged dataset as "SeqSQCclass" for following analysis. The "SeqSQCclass" uses GDS (Genomic Data Structure) to save and access the genotype and phenotype data in a more efficient way (3~5 times comparing to vcf), and could also incorporate all QC results in the same data for a more convenient processing and management. 2. Sample quality issues are very general and could come from different reasons, for example, self-identified gender problem, sample contamination, unported sample relatedness, and self-identified ancestry issue. "genotypeeval" evaluates sample quality mostly from the general aspect of # variants comparing to the reference data to indicate sequencing error (sample quality issue from sequencing end, not sample itself). it makes ancestry estimation for samples and gives the value of proportion to each ancestry, which is not directly understandable to users. However, SeqSQC could identify problematic samples from all possible sources, including inbreeding outlier, population outliers and samples with gender mismatch, cryptic relatedness. The plots of study samples together with benchmark samples from the same population group give an direct and interactive idea of how these samples are identified as problematic. <2> R CODE: Q: Can your SeqSQCclass extend SummarizeExperiment? A: No. The SummarizeExperiment stores count data from RNAseq dataset. The SeqSQC stores and processes genotype data from whole-genome sequencing or whole-exome sequencing, which are extremely high-dimensional and requires large R space for processing. I used GDS (Genomic Data Structure) within the SeqSQCclass to achieve efficient genotype data storage and access. The other updates are included in the "NEWS" file. Please check. Thanks. Best, QianDear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171004123535.html
No, SummarizedExperiment is not just for storing count data from RNA-seq. It's designed exactly for this use case. It would probably benefit from a GDS backend to DelayedArray, if one does not yet exist.
Sorry for the uncareful phrasing... To re-answer the question (Can your SeqSQCclass extend SummarizeExperiment?), I have some core functions within my package that are based on the GDS format, but the users are not required to know the operation of GDS files in R. So I have defined a new class SeqSQCclass to store the file path of the main gds file where we store the genotype information from sequencing data, and the sample QC results, like sex check, inbreeding and IBD coefficients, so that uses could easily read out the sample QC results from the SeqSQCclass object.
Because the GDS data stores the high-dimensional genotype data from sequencing experiments, and it allows users to perform operations on it without loading the whole data matrix in memory. I was not able to figure out if the SummarizedExperiment could support the GDS data structure. If yes, I would love to update in the future version of my package extending to SummarizedExperiment.
Received a valid push; starting a build. Commits are:
354a9fe debugged R CMD BiocCheck;
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171004183449.html
It should be possible to implement DelayedArray on top of GDS; then you just stick that matrix into the SummarizedObject object. I also wonder whether you couldn't use the objects from the SeqArray package, which already wrap GDS objects, although not as a SummarizedExperiment.
Received a valid push; starting a build. Commits are:
5992bf7 Installed BiocGenerics locally.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171005111346.html
Hi Michael,
Thank you for your insightful suggestions and comments. I have looked at SeqArray during my development of SeqSQC and cited it in my paper (yet to submit) for SeqSQC. It is good at dealing with high-dimensional sequencing data, and could at the same time save the annotation info for both variants and samples. It is developed on top of GDS (same developer).
While the main purpose for my package SeqSQC is to implement sample quality check with the genotype information from sequencing data. Functions are written to take the vcf or plink (next version) as input and save them in an efficient data structure, and then use the new data structure as input for the later QC functions. The SeqSQCclass contains 2 slots, 1 is the filepath to the gds file where the genotype data is stored, and the other slot is a list containing sample annotation (population, gender, relationship to other samples), and QC results like sex check and etc.. This structure is easy to deal with, and the users are not supposed to do any operation on the gds file itself. All QC steps will read in the SeqSQCclass object, extract information from the gds file in the backend for QC purpose, output QC result and append it to the second slot in SeqSQCclass object.
The other reason for defining the SeqSQCclass is that, we need a strict format for the input files. For example, we need the input of sample population to be strictly within AFR, EUR, ASN, EAS, SAS. Anything beyond these would return an error. So by defining the new class, we could do the validity test for these information.
As a summary, the SeqSQCclass object serves 3 purposes, one is to protect the gds file to be intact (by only including the filepath); the other is to save the QC results in a easily readable way; last, it could do the validity test for input information.
Best, Qian
It seems like the first concern may be satisfied by the data structures from SeqArray. Perhaps SeqSQC could have a slot for one of those, and another slot for the QC information. The population constraint seems to have more to do with the input rather than an object containing the computed QC information. Are you sure the two should be coupled? Also, it seems redundant to have a class with "class" in its name.
Hi Michael,
Thank you for your suggestion for removing "class" in the name of "SeqSQCclass". I will correct it and push another version.
For the question of coupling of sample annotation and QC results in SeqSQCclass object:
SeqSQC has these plot functions corresponding to each QC step (plotMissingRate, plotSexCheck...). It take SeqSQCclass objects as input, and read both of the sample annotation and the QC result for plotting. By including the sample annotation file in the SeqSQCclass object, it could make the checking process easier for the plot functions, by only checking if the input is a valid SeqSQCclass object.
Best, Qian
Some further comments:
vignette code:
> seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report="report.html", interactive = TRUE)
Error in sampleQC(vfile = infile, output = outfile, capture.region = cr, :
object 'seqfile' not found
> seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'QCresult' for signature '"SeqSQCclass"'
MAN:
R CODE:
Note: I may make comments for one R script but should be generally applied to all R scripts:
subsetGDS
sampleQC
problemList
general about plotting functions:
mergeGDS
general
[ ] if you are constantly reading in and utilizing different nodes of the gdsn file have a wrapper function that stores the output and access from that
function(gfile){
obj <- list(read.gdsn(index.gdsn(gfile, "sample.id")),
read.gdsn(index.gdsn(gfile, "sample.annot"))
obj
}
you do this a lot in the merge and I'm seeing again in sexcheck - and im sure it lots of other places of code Like this section is repeated in several files can it be condensed into its own
gfile <- SeqOpen(seqfile, readonly=TRUE)
samples <- read.gdsn(index.gdsn(gfile, "sample.id"))
sampleanno <- read.gdsn(index.gdsn(gfile, "sample.annot"))
snp.chr <- read.gdsn(index.gdsn(gfile, "snp.chromosome"))
snp.pos <- read.gdsn(index.gdsn(gfile, "snp.position"))
snps <- read.gdsn(index.gdsn(gfile, "snp.id"))
## sample filters. (remove prespecified "remove.samples", and only keep samples within the study population)
study.pop <- unique(sampleanno[sampleanno$group == "study", "population"])
if(length(study.pop) > 1) stop("Study samples should be single population, please prepare input file accordingly.")
if(study.pop == "ASN") study.pop <- c("EAS", "SAS", "ASN")
LoadVfile
IBDRemove and IBDRemoveAll
Please make updates or address these issues in some way and comment back here with changes and when ready for another review.
Thank you ~L
Hi Lori,
Thank you for the second review. I will address these issues and hopefully get back to you by tomorrow.
Best, Qian
Received a valid push; starting a build. Commits are:
cffacae 2nd review
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171015235112.html
HI Lori,
I have pushed a new version of SeqSQC, with the following revisement:
To answer some of the other questions:
A: The sampleQC is written to do the one-step sample QC to include data preparation, all QC steps and QC result summary, and to make a complete and reliable sample QC for the NGS data, which is essential for cleaning samples before any downstream analysis. There are also some separate functions for each QC step and plotting if the user want to skip or do any specific ones. There are also already 13 arguments in the "sampleQC" function, I can add some additional arguments to let the user define which one to skip, but not quite sure if they are really needed.
A: Yes.. Because the result data from sex check is written to have the column as "female/male" rather than other format.
A: There are 4 population from our benchmark data (assembled from 1000 Genomes Project), which are EUR, AFR, EAS (east Asian) and SAS (south Asian). So the Asian population were represented by 2 detailed EAS and SAS separately. If user have samples from Asian, we will use both of the EAS and SAS benchmark samples to assist the QC steps.
Please let me know if you have any questions or concerns. Thanks a lot!
Best, Qian
I think we are close ... A few very minor vignette and man pages updates - and some optional suggestions for future implementations:
Vignette:
[ ] When I run the vignette code - all the output is placed in the current directory as it uses the dirname(outfile) and outfile ="testWrapUp" - we recommend using a tempdir() for vignette output
outdir = tempdir()
outfile = file.path(outdir, "testWrapUp")
also it seems like the argument output is the name of the file for an output object and the dirname of that is used as output directory for the rest of the files - this should be made more clear in the argument description or the output directory should be the directory location - and have a default name for the object like you do for the other txt and pdf output files
[ ] Optional future implementation - I still think there should be a flag for whether the write tables and pdfs get generated since it seems like these could be generated at any time in the future with the object - there doesn't have to be one for each step just a generic argument like plotting=TRUE, results=TRUE
knowing that all the steps will follow the same argument.
[ ] GDS Class "Other information from the VCF file, including the chromosome, position, snp rs id, reference allele, alternative allele, quality score and filter can also be passed into the gds file." How? Is this stored in your object? or is just an option based on using the gds.class?
[ ] Standard workflow - I would mention that running SampleQC runs all of these steps as a convenient all inclusive method but that these are the steps and recommended ordering if run individually and not using SampleQC ?
Man: Thanks for adding the package documentation - add a seealso and/or mention the main functions of your package sampleQC
R Code Plotting functions - I like the one exposed plotting function this is good!
If you fix the vignette and man pages I think this is ready for acceptance and first implementation but again consider the code improvements suggested by Michael and myself for further improvements.
When ready for a look over - please do the version bump and comment back here with updates and answers to questions.
Received a valid push; starting a build. Commits are:
9d25b5e tempdir() for vignette output; updated argument 'o...
Hi Lori,
Thanks a lot for your very detailed review and practical suggestions. I have addressed most of the issues including:
dirname(output)
would be used as the directory to save the other QC results in .txt
and plots in .pdf
. The default value is added as file.path(tempdir(), "sampleqc")
.plotting=TRUE, results=TRUE
added and documented, giving the option for users to whether or not output the QC results and plots. The problem list will always in output for the simplest sample QC result and summary.@seealso
for links of main functions of SeqSQC. To answer some of your other questions:
Answer: The meta info from vcf files are read into the gds file by \code{snpgdsVCF2GDS} within \code{LoadVfile}. They are saved in the gds file, with the filename kept as slotName \code{gdsfile} in the SeqSQC object. The reading is not optional. We used these snp information to merge the study cohort and benchmark data.
Answer: - the underlining plotting functions are written for different data sets generated from different QC steps. The data structure are slightly different, (except for IBD and PCA results which have more complex structure and plotting options). I will try to generalize the data structure of QC outputs and also the plotting functions in the next version, when I would possibly adopt the SeqArray or delayArray for NGS data storage and processing.
Please review the package and let me know for any further question. Thanks again!
Best, Qian
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016123535.html
The default value is added as file.path(tempdir(), "sampleqc"). - It probably should default to the users working directory when setting the default within the function but you would want to use the tempdir() in the vignette. The tempdir creates a temporary directory that disappears after the R session is terminated - this is appropriate for the examples run in the vignette but probably not as a default for your main functions if the user forgets to specify (most assume the working directory) - so the default can just be "sampleqc"
The other changes look good.
Received a valid push; starting a build. Commits are:
ba50e9b argument of 'output' in 'sampleQC' and 'LoadVfile'...
Don't forget to update the man files too - its causing the mismatch in documentation/code
Hi Lori,
Thanks for your suggestion. I have just updated the default value for output
to be in the working directory. It makes more sense right now.
output
argument in function of sampleQC
and LoadVfile
, by using the default value of \code{sampleqc} in working directory. In vignette and the example code of these 2 functions, temporary directory is used for the directory to save the QC results.Thanks. Qian
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016143913.html
Received a valid push; starting a build. Commits are:
27aef94 documentation updated for updated argument 'output...
Sorry for the code/documentation mismatches... updated the documentation. Thanks.
No problem. I'll wait for the build report but as long as its clean now I'll accept the package.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016145525.html
Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be sent to the maintainer email address in the next several days.
Thank you for contributing to Bioconductor!
Hi Lori & Michael,
Thank you for your help and insightful comments and suggestions about SeqSQC. It works with NGS data for sample quality check quite efficiently right now. but still I am looking to update the package in the newer version to:
Thanks again and your comments and suggestions are greatly appreciated.
Best, Qian
The master branch of your GitHub repository has been added to Bioconductor's git repository.
To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/Liubuntu.keys is not empty), then no further steps are required. Otherwise, do the following:
See further instructions at
https://bioconductor.org/developers/how-to/git/
for working with this repository. See especially
https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/
to keep your GitHub and Bioconductor repositories in sync.
Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at
https://bioconductor.org/checkResults/
(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"SeqSQC\")
. The package 'landing page' will be created at
https://bioconductor.org/packages/SeqSQC
If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.
SeqSQC is an Experiment Data package.