Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

SeqSQC formal review #460

Closed Liubuntu closed 7 years ago

Liubuntu commented 7 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

SeqSQC is an Experiment Data package.

bioc-issue-bot commented 7 years ago

Hi @Liubuntu

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: SeqSQC
Title: A bioconductor package for sample quality check with next generation sequencing data
Version: 0.99.0
Authors@R: person("Qian", "Liu", email = "qliu7@buffalo.edu", role = c("aut", "cre"))
Description: The SeqSQC is designed to identify problematic samples in NGS data,
    including samples with gender mismatch, contamination, cryptic relatedness, and
    population outlier.
biocViews: Experiment Data, Homo sapiens Data, Sequencing Data, Project1000genomes
Depends:
    R (>= 3.3.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
VignetteBuilder: knitr
Imports:
    e1071,
    GenomicRanges,
    GGally,
    gdsfmt,
    ggplot2,
    IRanges,
    methods,
    rbokeh,
    RColorBrewer,
    reshape2,
    rmarkdown,
    S4Vectors,
    SNPRelate
Suggests:
    BiocStyle,
    knitr
bioc-issue-bot commented 7 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170828161653.html

lshep commented 7 years ago

Please work on correcting the ERRORs, Warnings, and Notes from the build report. Also please ensure that a webhook is initialized so that on subsequent version bumps a new build of the package will be generated.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

9f5a740 R CMD check/BiocCheck errors and warnings.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170829112019.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

5d17303 subscribe to bioc-devel mailing list

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170829114042.html

Liubuntu commented 7 years ago

Hi Lori,

I have just build the "SeqSQC_0.99.2" and have passed check and BiocCheck with no error or warning. Please help me with the next step. Thank you!

Qian

lshep commented 7 years ago

Hi Qian, Some comments/questions below:

BUILD REPORT:

VIGNETTE:

GENERAL QUESTION: Could you do a quick comparison to genotypeeval? They may not be similar at all but also does some sort of QC with VCF.

MAN:

R Code:

Please address the above issues and when you are ready for a second review please comment back here with what has been updated or why things haven't and/or some clarification on above concerns.

Cheers Lori

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

e180459 added zzz.R

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20170915232453.html

lshep commented 7 years ago

Please clean up the R directory. It looks like you uploaded some temp files into the directory (#LoadVfile.R#, .#LoadVfile.R , zzz.R~ ) .

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

9ce2d4c add test units; modified NEWS.

Liubuntu commented 7 years ago

HI Lori,

I just pushed a new version of 0.99.4. Here are answers to 2 of your review questions:

<1>. COMPARISON with "genotypeeval". The genotypeeval could evaluate sample quality with the VCF files (in batch). By loading some reference “gold” standards such as 1000 Genomes or dbSNP, it evaluates the samples for ti/tv ratio in and out of the coding regions, and the total # of calls (or specific # of homozygous/heterozygous alternative calls), and ancestry estimation, and etc.. However the package of "genotypeeval" is different from "SeqSQC" in these 2 major ways: 1. "genotypeeval" takes vcf files as input, and processes directly with the vcf files. Given that size vcf files from NGS (whole-exome / whole-genome) are in Gbs for human genome, it would be not efficient, especially when processing with batch vcf files. SeqSQC take the vcf files as input, merge it with the benchmark data we have assembled from 1000 Genomes Project, and save the merged dataset as "SeqSQCclass" for following analysis. The "SeqSQCclass" uses GDS (Genomic Data Structure) to save and access the genotype and phenotype data in a more efficient way (3~5 times comparing to vcf), and could also incorporate all QC results in the same data for a more convenient processing and management. 2. Sample quality issues are very general and could come from different reasons, for example, self-identified gender problem, sample contamination, unported sample relatedness, and self-identified ancestry issue. "genotypeeval" evaluates sample quality mostly from the general aspect of # variants comparing to the reference data to indicate sequencing error (sample quality issue from sequencing end, not sample itself). it makes ancestry estimation for samples and gives the value of proportion to each ancestry, which is not directly understandable to users. However, SeqSQC could identify problematic samples from all possible sources, including inbreeding outlier, population outliers and samples with gender mismatch, cryptic relatedness. The plots of study samples together with benchmark samples from the same population group give an direct and interactive idea of how these samples are identified as problematic. <2> R CODE: Q: Can your SeqSQCclass extend SummarizeExperiment? A: No. The SummarizeExperiment stores count data from RNAseq dataset. The SeqSQC stores and processes genotype data from whole-genome sequencing or whole-exome sequencing, which are extremely high-dimensional and requires large R space for processing. I used GDS (Genomic Data Structure) within the SeqSQCclass to achieve efficient genotype data storage and access. The other updates are included in the "NEWS" file. Please check. Thanks. Best, Qian
bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171004123535.html

lawremi commented 7 years ago

No, SummarizedExperiment is not just for storing count data from RNA-seq. It's designed exactly for this use case. It would probably benefit from a GDS backend to DelayedArray, if one does not yet exist.

Liubuntu commented 7 years ago

Sorry for the uncareful phrasing... To re-answer the question (Can your SeqSQCclass extend SummarizeExperiment?), I have some core functions within my package that are based on the GDS format, but the users are not required to know the operation of GDS files in R. So I have defined a new class SeqSQCclass to store the file path of the main gds file where we store the genotype information from sequencing data, and the sample QC results, like sex check, inbreeding and IBD coefficients, so that uses could easily read out the sample QC results from the SeqSQCclass object.

Because the GDS data stores the high-dimensional genotype data from sequencing experiments, and it allows users to perform operations on it without loading the whole data matrix in memory. I was not able to figure out if the SummarizedExperiment could support the GDS data structure. If yes, I would love to update in the future version of my package extending to SummarizedExperiment.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

354a9fe debugged R CMD BiocCheck;

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171004183449.html

lawremi commented 7 years ago

It should be possible to implement DelayedArray on top of GDS; then you just stick that matrix into the SummarizedObject object. I also wonder whether you couldn't use the objects from the SeqArray package, which already wrap GDS objects, although not as a SummarizedExperiment.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

5992bf7 Installed BiocGenerics locally.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171005111346.html

Liubuntu commented 7 years ago

Hi Michael,

Thank you for your insightful suggestions and comments. I have looked at SeqArray during my development of SeqSQC and cited it in my paper (yet to submit) for SeqSQC. It is good at dealing with high-dimensional sequencing data, and could at the same time save the annotation info for both variants and samples. It is developed on top of GDS (same developer).

While the main purpose for my package SeqSQC is to implement sample quality check with the genotype information from sequencing data. Functions are written to take the vcf or plink (next version) as input and save them in an efficient data structure, and then use the new data structure as input for the later QC functions. The SeqSQCclass contains 2 slots, 1 is the filepath to the gds file where the genotype data is stored, and the other slot is a list containing sample annotation (population, gender, relationship to other samples), and QC results like sex check and etc.. This structure is easy to deal with, and the users are not supposed to do any operation on the gds file itself. All QC steps will read in the SeqSQCclass object, extract information from the gds file in the backend for QC purpose, output QC result and append it to the second slot in SeqSQCclass object.

The other reason for defining the SeqSQCclass is that, we need a strict format for the input files. For example, we need the input of sample population to be strictly within AFR, EUR, ASN, EAS, SAS. Anything beyond these would return an error. So by defining the new class, we could do the validity test for these information.

As a summary, the SeqSQCclass object serves 3 purposes, one is to protect the gds file to be intact (by only including the filepath); the other is to save the QC results in a easily readable way; last, it could do the validity test for input information.

Best, Qian

lawremi commented 7 years ago

It seems like the first concern may be satisfied by the data structures from SeqArray. Perhaps SeqSQC could have a slot for one of those, and another slot for the QC information. The population constraint seems to have more to do with the input rather than an object containing the computed QC information. Are you sure the two should be coupled? Also, it seems redundant to have a class with "class" in its name.

Liubuntu commented 7 years ago

Hi Michael,

Thank you for your suggestion for removing "class" in the name of "SeqSQCclass". I will correct it and push another version.

For the question of coupling of sample annotation and QC results in SeqSQCclass object:

SeqSQC has these plot functions corresponding to each QC step (plotMissingRate, plotSexCheck...). It take SeqSQCclass objects as input, and read both of the sample annotation and the QC result for plotting. By including the sample annotation file in the SeqSQCclass object, it could make the checking process easier for the plot functions, by only checking if the input is a valid SeqSQCclass object.

Best, Qian

lshep commented 7 years ago

Some further comments:

vignette code:

> seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = TRUE, out.report="report.html", interactive = TRUE)
Error in sampleQC(vfile = infile, output = outfile, capture.region = cr,  : 
  object 'seqfile' not found

MAN:

R CODE:

Note: I may make comments for one R script but should be generally applied to all R scripts:

subsetGDS

sampleQC

problemList

general about plotting functions:

mergeGDS

general

LoadVfile

IBDRemove and IBDRemoveAll

Please make updates or address these issues in some way and comment back here with changes and when ready for another review.

Thank you ~L

Liubuntu commented 7 years ago

Hi Lori,

Thank you for the second review. I will address these issues and hopefully get back to you by tomorrow.

Best, Qian

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

cffacae 2nd review

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171015235112.html

Liubuntu commented 7 years ago

HI Lori,

I have pushed a new version of SeqSQC, with the following revisement:

To answer some of the other questions:

  1. sampleQC: You might have arguments to the function to make plotting and writing tables optional; would there ever be a case where someone would want to skip one of the QC checks and is this possible? it looks like this a wrapper around the individual that the user could run on their own so this would be a convenient way to run all which is why you expose all the checks to the user? problemList

A: The sampleQC is written to do the one-step sample QC to include data preparation, all QC steps and QC result summary, and to make a complete and reliable sample QC for the NGS data, which is essential for cleaning samples before any downstream analysis. There are also some separate functions for each QC step and plotting if the user want to skip or do any specific ones. There are also already 13 arguments in the "sampleQC" function, I can add some additional arguments to let the user define which one to skip, but not quite sure if they are really needed.

  1. In SampleQC.R, are you sure for sexCheck that it will always be "female/male" not "Female/Male" or F/M etc. ...

A: Yes.. Because the result data from sex check is written to have the column as "female/male" rather than other format.

  1. if(study.pop == "ASN") study.pop <- c("EAS", "SAS", "ASN") what is the reasoning of doing three pop if the specified is asian - are populations limited to EAS SAS and ASN only or are there more?

A: There are 4 population from our benchmark data (assembled from 1000 Genomes Project), which are EUR, AFR, EAS (east Asian) and SAS (south Asian). So the Asian population were represented by 2 detailed EAS and SAS separately. If user have samples from Asian, we will use both of the EAS and SAS benchmark samples to assist the QC steps.

Please let me know if you have any questions or concerns. Thanks a lot!

Best, Qian

lshep commented 7 years ago

I think we are close ... A few very minor vignette and man pages updates - and some optional suggestions for future implementations:

Vignette:

Man: Thanks for adding the package documentation - add a seealso and/or mention the main functions of your package sampleQC

R Code Plotting functions - I like the one exposed plotting function this is good!

If you fix the vignette and man pages I think this is ready for acceptance and first implementation but again consider the code improvements suggested by Michael and myself for further improvements.

When ready for a look over - please do the version bump and comment back here with updates and answers to questions.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

9d25b5e tempdir() for vignette output; updated argument 'o...

Liubuntu commented 7 years ago

Hi Lori,

Thanks a lot for your very detailed review and practical suggestions. I have addressed most of the issues including:

To answer some of your other questions:

Answer: The meta info from vcf files are read into the gds file by \code{snpgdsVCF2GDS} within \code{LoadVfile}. They are saved in the gds file, with the filename kept as slotName \code{gdsfile} in the SeqSQC object. The reading is not optional. We used these snp information to merge the study cohort and benchmark data.

Answer: - the underlining plotting functions are written for different data sets generated from different QC steps. The data structure are slightly different, (except for IBD and PCA results which have more complex structure and plotting options). I will try to generalize the data structure of QC outputs and also the plotting functions in the next version, when I would possibly adopt the SeqArray or delayArray for NGS data storage and processing.

Please review the package and let me know for any further question. Thanks again!

Best, Qian

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016123535.html

lshep commented 7 years ago

The default value is added as file.path(tempdir(), "sampleqc"). - It probably should default to the users working directory when setting the default within the function but you would want to use the tempdir() in the vignette. The tempdir creates a temporary directory that disappears after the R session is terminated - this is appropriate for the examples run in the vignette but probably not as a default for your main functions if the user forgets to specify (most assume the working directory) - so the default can just be "sampleqc"

The other changes look good.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

ba50e9b argument of 'output' in 'sampleQC' and 'LoadVfile'...

lshep commented 7 years ago

Don't forget to update the man files too - its causing the mismatch in documentation/code

Liubuntu commented 7 years ago

Hi Lori,

Thanks for your suggestion. I have just updated the default value for output to be in the working directory. It makes more sense right now.

Thanks. Qian

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016143913.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

27aef94 documentation updated for updated argument 'output...

Liubuntu commented 7 years ago

Sorry for the code/documentation mismatches... updated the documentation. Thanks.

lshep commented 7 years ago

No problem. I'll wait for the build report but as long as its clean now I'll accept the package.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/SeqSQC_buildreport_20171016145525.html

bioc-issue-bot commented 7 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be sent to the maintainer email address in the next several days.

Thank you for contributing to Bioconductor!

Liubuntu commented 7 years ago

Hi Lori & Michael,

Thank you for your help and insightful comments and suggestions about SeqSQC. It works with NGS data for sample quality check quite efficiently right now. but still I am looking to update the package in the newer version to:

Thanks again and your comments and suggestions are greatly appreciated.

Best, Qian

mtmorgan commented 7 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/Liubuntu.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"SeqSQC\"). The package 'landing page' will be created at

https://bioconductor.org/packages/SeqSQC

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.