ISLET - Githubissues

haoharryfeng commented 2 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/haoharryfeng/ISLET

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
[x] I am familiar with the Bioconductor code of conduct and agree to abide by it.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 2 years ago

Hi @haoharryfeng

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: ISLET
Type: Package
Title: Individual-Specific ceLl typE referencing Tool 
Version: 0.99.1
Date: 2022-06-13
Authors@R: c(person("Hao", "Feng", role = c("aut", "cre"),
           email = "hxf155@case.edu"))
Description: ISLET is a method to conduct signal deconvolution for general -omics data. It can estimate the individual-specific and cell-type-specific reference panels, when there are multiple samples observed from each subject. It takes the input of the observed mixture data (feature by sample matrix), and the cell type mixture proportions (sample by cell type matrix), and the sample-to-subject information. It can solve for the reference panel on the individual-basis. It can also conduct test to identify cell-type-specific differential expression (csDE) genes.  
License: GPL-2
Depends: R(>= 4.2.0), Matrix, BiocParallel, parallel
Imports: stats
Suggests: BiocStyle, knitr, rmarkdown, htmltools
Collate: utils.R dataprep.R islet.est.R islet.solve.R islet.test.R
biocViews: Software, RNASeq, Transcriptomics, DataRepresentation, Transcription, Sequencing, GeneExpression, DataImport, DifferentialExpression, DifferentialMethylation
Encoding: UTF-8
LazyData: false
VignetteBuilder: knitr

vjcitn commented 2 years ago

In vignette we have

case_obs is a gene expression value data frame of 1,000 genes by 300 cancer sample, for the cancer group. ctrl_obs is a gene expression value data frame of 1,000 genes by 300 control sample, for the control group.

case_prop is a cell proportion matrix of 300 cancer samples by 6 cell types, for the cancer group. ctrl_prop is a cell proportion matrix of 300 control samples by 6 cell types, for the control group.

case_subject is a two-column data frame showing the relationship between the 300 cancer samples and their 100 subjects, for the cancer group. The first column is the sample_id. The second column is the subject_id. Each subject has 3 repeated/temporal measurements. ctrl_subject is a two-column data frame showing the relationship between the 300 control samples and their 100 subjects, for the control group. The format is the same with case_subject.

We generally advocate use of SummarizedExperiment for genomic data, so that sample-level data are tightly bound to assay data. Please use it as opposed to separate data frames.

haoharryfeng commented 2 years ago

Hi @vjcitn , thanks for the suggestion. I've made changes to use SummarizedExperiment as the standard input. Let me know if you have other questions.

bioc-issue-bot commented 2 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

lshep commented 2 years ago

@DarioS Please take the lead of this review. Thank you

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 7cd4d1a8f405e2c0be84d715a375c36eb7912022

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

DarioS commented 2 years ago

I have reviewed the package and the issues which I have identified are:

It is strange that the input format required is one table for controls and one table for cases. This needs to be changed to one SummarizedExperiment with phenotype data stored in colData(dataset). The package has not been thoroughly converted from its original dependence on plain matrices in response to Vincent's request https://github.com/Bioconductor/Contributions/issues/2695#issuecomment-1157512375.
Why is the cell type proprtion (i.e. colData(case.dat)) required as input? This is almost always unknown for real data. The vignette explains that "ISLET can deconvolute mixture samples ..." so why is proportion input rather than output?
There is inconsistent use of = and <- for variable assignment. Please ensure that all assignments are <- and See Code Syntax and Efficiency section of Developer's Guide.
Use camelCase format for variable names and function names. See Variable Names and Function Names sections.
Use informative variable names, not single letter ones, such as, X = X, A = A, K = K, NS = NS, NU = NU.
Use one space after commas, according to Use of Space. For example, [,1] should be [, 1].
Avoid copying and pasting code within a function. For example, in dataprep.R, dataprep.R line 26: NU = length(unique(colData(case_dat_se)[,1])) + length(unique(colData(ctrl_dat_se)[,1])) line 57: case_num = length(unique(colData(case_dat_se)[,1])), ctrl_num = length(unique(colData(ctrl_dat_se)[,1]))
Lack of checking of input data. Function defined as dataprep<-function(case_dat_se, ctrl_dat_se) but no class check. Should either have is(inputData, "SummarizedExperiment") at the beginning or even better would be to use S4 methods and specify "SummarizedExperiment" as the method signature.
```
setMethod("inputPreprocess", "SummarizedExperiment", function(variable1, variable2)
{
# Do the preprocessing.
})
```
Don't use 1:G for index iteration but seq_along(G). See Code Syntax and Efficiency.

Excessively long lines of code that are hard to read. See Indentation section. For example:

case.indv = lapply(seq_len(K),function(k){rel[[k]][seq_len(datuse$case_num),] + matrix(rep(case.m[k,], each = datuse$case_num),nrow=datuse$case_num)})
ctrl.indv = lapply(seq_len(K),function(k){rel[[k]][-seq_len(datuse$case_num),] + matrix(rep(ctrl.m[k,], each = datuse$ctrl_num),nrow=datuse$ctrl_num)})

150 characters is quite long.

Dont use for loops. For example, for(iTest in seq_len(input$K)). See Code Syntax and Efficiency.
There are no unit tests. Such tests should check each function with a small data set with known outcome. See Unit Tests.
Inappropriate choices of biocViews. It should not have DataRepresentation, DataImport, Coverage. These are for low-level packages like GenomicRanges and BiocFileCache. Although your package uses a data representation called SummarizedExperiment, it does not create any new ones, so it's not appropriate to have it in that view.
No need to depend on both parallel and BiocParallel. Just depend on BiocParallel.

BiocParallel works on all operating systems, but the code only limits it to Linux. Please remove this limitation from your code.

if(.Platform$OS.type == "unix") {
## do some parallel computation under Unix
  G = nrow(input$exp_case)
  res = bplapply(seq_len(G), islet.est.bp, datuse = input)
}

The parallel computation doesn't allow the user to control how many CPUs will be used. Please allow the user to pass in an BiocParallelParam object that will customise how many CPUs will be used.

The vignette is not accesssible from R's command line.

> browseVignettes("ISLET")
No vignettes found by browseVignettes("ISLET")
> browseVignettes("GenomicRanges") # Works correctly assuming GenomicRanges is installed.

Please see Vignettes and particuarly VignetteIndexEntry.

Vignette should contain installation instructions near the top. Please see Installation.

The functions create lists of large data tables. If I do study123input or res.sol, it fills my R console with hundreds of lines of text. Please don't use plain lists as a return type but create an S4 object with a show method to concisely display results to screen. Consider DataFrame as a simple example about how to do it well:

> data <- DataFrame(name = LETTERS[1:26], score = 1:26)
> data # Even if DataFrame has 1 million rows, displaying it won't flood the R console.
DataFrame with 26 rows and 2 columns
       name     score
<character> <integer>
1             A         1
2             B         2
3             C         3
4             D         4
5             E         5
...         ...       ...
22            V        22
23            W        23
24            X        24
25            Y        25
26            Z        26

Note how the display of the variable's contents is concise. ISLET needs to improve in this regard.

Vignette needs more and clearer explanation. What do you mean by differential expression exactly? This can be done in many different ways. Are you using a negative binomial GLM for counts or something else? Please be specific. I see in islet.test that there is a likelihood ratio test, but nothing about the expected data distrubtion. Can data be RNA-seq count data or does it need to be homoscedastic data like microarray log values? How do you account for sample A having 30 million reads and sample B having 50 million reads in total in your deconvolution? Is there any library size adjustment? "The result res.test is a matrix of p-values" but what are the names of the hypothesis tests which can be done? Also, the vignette seems to use randomly-generated data for its demonstration dataset. Please use a real experimental dataset to demonstrate how the R package solves a real-world problem, rather than just an artificial one. You might be able to find one in Bioconductor's ExperimentHub database of experimental data or perhaps include your own dataset or include a public dataset in the data directory. In real datasets which I have analysed, the cell type proportions of samples are not known, but are a key requirement of the input data for ISLET. Also please define "Slope Testing" in one or two sentences. The vignette should be a comprehensive resource for new users of the software.
The package seems to be limited to two classes of samples. Why not three or more (e.g. complete responder, stable disease, progressing disease) classes?.
The package seems to only allow one covariate and it must be in the second column of the column data. This seems rather inflexible and rigid. Please change this and make it customisable by the user.

This time/age covariate must be stored in the second column in colData, to successfully execute this testing. How does the package determine which columns are cell type proportions and which are covariates? I see nothing explaining this in dataprep.Rd. If fact, it says "In the colData slot, the subject IDs take the first column, and the cell type proportions take the remaining columns." So it seems like covariates aren't allowed in the column data if I read this, which seems to contradict the vignette's advice.
Is calculating p-values after deconvolution statistically valid? Valid Post-clustering Differential Analysis for Single-cell RNA-Seq.

We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries.

It would be interesting to see some evaluation if the method is anti-conservative.

Consider providing a publication-quality visualisation function. Currently, the package only produces text results.

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: dc55ace758f09861753b100b6f3084da79bcdd78

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 045d3ab109415773c83dde86013f60a292a55300

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

haoharryfeng commented 2 years ago

Hi @DarioS , we have addressed all your questions in our updated version. Please see below for a point-by-point response to your questions:

I have reviewed the package and the issues which I have identified are:

It is strange that the input format required is one table for controls and one table for cases. This needs to be changed to one SummarizedExperiment with phenotype data stored in colData(dataset). The package has not been thoroughly converted from its original dependence on plain matrices in response to Vincent's request ISLET #2695 (comment).

Response: Thanks for pointing this out. We have revised the required input format. Now ISLET takes only one SummarizedExperiment object, with phenotype data (case/control status) stored in the first column in colData(). Now ISLET should fully compliant with Vincent’s suggestion.

Why is the cell type proprtion (i.e. colData(case.dat)) required as input? This is almost always unknown for real data. The vignette explains that "ISLET can deconvolute mixture samples ..." so why is proportion input rather than output?

Response: We have revised our vignette to make the aim clear. ISLET indeed needs cell type proportions as input. And solving for cell type proportions is not the goal of ISLET. There has been quite a few methods available that can help solve for cell type proportions, using either reference-free approach and reference-base approach. Here, ISLET can directly take the solved proportions, and use them as our input. One of the merits of ISLET is that it can accurately solve for subject-specific reference panels, especially when there are multiple replicates per subject.

There is inconsistent use of = and <- for variable assignment. Please ensure that all assignments are <- and See Code Syntax and Efficiency section of Developer's Guide.

Response: We have revised all of our variable assignment code. Now all assignments are <-.

Use camelCase format for variable names and function names. See Variable Names and Function Names sections.

Response: Thanks for the suggestions on better practice on names. We have revised ISLET so that all names visible to users are now in camelCase.

Use informative variable names, not single letter ones, such as, X = X, A = A, K = K, NS = NS, NU = NU.

Response: Thanks for the suggestion. All variable names available to users are informative in this version. Typically we would use informative names for all variables. Here in this case, these variable names are designed to align with the mathematical notations in our methodology manuscript (in progress). We are making some exceptions here for better alignment reason. We have also provided comments in the code to annotate the variables.

Use one space after commas, according to Use of Space. For example, [,1] should be [, 1].

Response: Thanks for the suggestion. We have thoroughly reviewed and revised the code. Now ISLET is in compliant with the one space after commas requirement. We have also reviewed the use of space requirements and improved ISLET’s coding style.

Avoid copying and pasting code within a function. For example, in dataprep.R, dataprep.R line 26: NU = length(unique(colData(case_dat_se)[,1])) + length(unique(colData(ctrl_dat_se)[,1])) line 57: case_num = length(unique(colData(case_dat_se)[,1])), ctrl_num = length(unique(colData(ctrl_dat_se)[,1]))

Response: Thanks for the suggestion. We have revised the code to reduce the copying and pasting, include the example you mentioned in dataprep.R

Lack of checking of input data. Function defined as dataprep<-function(case_dat_se, ctrl_dat_se) but no class check. Should either have is(inputData, "SummarizedExperiment") at the beginning or even better would be to use S4 methods and specify "SummarizedExperiment" as the method signature.
setMethod("inputPreprocess", "SummarizedExperiment", function(variable1, variable2)
{
  # Do the preprocessing.
})

Response: Thanks for the suggestion. We have added input data checking step. This step checks the input data type, as well as additional aspects on group categories.

Don't use 1:G for index iteration but seq_along(G). See Code Syntax and Efficiency.

Response: Thanks for the suggestion. We have replaced 1:G style coding with seq_len(G) throughout the package.

Excessively long lines of code that are hard to read. See Indentation section. For example:

case.indv = lapply(seq_len(K),function(k){rel[[k]][seq_len(datuse$case_num),] + matrix(rep(case.m[k,], each = datuse$case_num),nrow=datuse$case_num)})
ctrl.indv = lapply(seq_len(K),function(k){rel[[k]][-seq_len(datuse$case_num),] + matrix(rep(ctrl.m[k,], each = datuse$ctrl_num),nrow=datuse$ctrl_num)})

150 characters is quite long.

Response: Thanks for the suggestion. We have revised the code to considerably reduced the length of code per line.

Dont use for loops. For example, for(iTest in seq_len(input$K)). See Code Syntax and Efficiency.

Response: Thanks for the suggestion. We have revised the code to replace the for loops with more computational efficient functions.

There are no unit tests. Such tests should check each function with a small data set with known outcome. See Unit Tests.

Response: Thanks for the suggestion. We have added unit test using the RUnit package as outlined in the Bioconductor readings.

Inappropriate choices of biocViews. It should not have DataRepresentation, DataImport, Coverage. These are for low-level packages like GenomicRanges and BiocFileCache. Although your package uses a data representation called SummarizedExperiment, it does not create any new ones, so it's not appropriate to have it in that view.

Response: Thanks pointing this out. We have revised the DESCRIPTION to remove these biocViews.

No need to depend on both parallel and BiocParallel. Just depend on BiocParallel.

Response: We’re addressing the 3 issues related to parallel computing together in the 3rd item. See our explanation below.

BiocParallel works on all operating systems, but the code only limits it to Linux. Please remove this limitation from your code.
    if(.Platform$OS.type == "unix") {
    ## do some parallel computation under Unix
      G = nrow(input$exp_case)
      res = bplapply(seq_len(G), islet.est.bp, datuse = input)
  }

Response: We’re addressing the 3 issues related to parallel computing together under the 3rd item. See our explanation below.

The parallel computation doesn't allow the user to control how many CPUs will be used. Please allow the user to pass in an BiocParallelParam object that will customise how many CPUs will be used.

Response: This is our response to the 3 issues related to parallel computing. First, we’re aware that BiocParallel works on all OS. During our experiments, we noticed that the SerialParam in Windows is unfavorably slow (much slower than in Unix). This brings unpleasant experience to users who use ISLET in Windows. Therefore, it motivates us to implement different parallel computing approaches for Unix and Windows. After some exploration, we found the parLapply in Windows performs the best, after segmenting the data in block-wise fashion. So we adopted different parallel computing approaches for Unix and Windows, to bring the best possible computing speed to users, regardless of their OS. Therefore, ISLET depends on parallel for Windows OS. We kindly ask you to allow the different parallel computing approach depending on the OS. With that said, we have added additional variables in both isletSolve and isletTest to allow users to control how many cores to use. The user-specified core numbers is later passed to an BiocParallelParam object to bplapply.

The vignette is not accesssible from R's command line.
> browseVignettes("ISLET")
No vignettes found by browseVignettes("ISLET")
> browseVignettes("GenomicRanges") # Works correctly assuming GenomicRanges is installed.
Please see Vignettes and particuarly VignetteIndexEntry.

Response: We have reviewed the vignettes section and made changes. We successfully accessed the vignette from R’s command line (after R CMD build and R CMD check, installed from tarball), and the error is no longer reproducible. With that said, we’re unsure what caused the problem previously. Please do let us know if there is still problems with accessing the vignette. Thank you.

Vignette should contain installation instructions near the top. Please see Installation.

Response: We have added installation instructions near the top of vignette. Additional, we have added a subsection to encourage users who encounter a problem to post an issue in ISLET’s GitHub page.

The functions create lists of large data tables. If I do study123input or res.sol, it fills my R console with hundreds of lines of text. Please don't use plain lists as a return type but create an S4 object with a show method to concisely display results to screen. Consider DataFrame as a simple example about how to do it well:
> data <- DataFrame(name = LETTERS[1:26], score = 1:26)
> data # Even if DataFrame has 1 million rows, displaying it won't flood the R console.
DataFrame with 26 rows and 2 columns
           name     score
    <character> <integer>
1             A         1
2             B         2
3             C         3
4             D         4
5             E         5
...         ...       ...
22            V        22
23            W        23
24            X        24
25            Y        25
26            Z        26
Note how the display of the variable's contents is concise. ISLET needs to improve in this regard.

Response: Thanks pointing this out. We have designed new S4 class for the input data and the inferred reference panel output, and added associated show methods to concisely visualize them. Now they won’t flood the screen.

Vignette needs more and clearer explanation. What do you mean by differential expression exactly? This can be done in many different ways. Are you using a negative binomial GLM for counts or something else? Please be specific. I see in islet.test that there is a likelihood ratio test, but nothing about the expected data distrubtion. Can data be RNA-seq count data or does it need to be homoscedastic data like microarray log values? How do you account for sample A having 30 million reads and sample B having 50 million reads in total in your deconvolution? Is there any library size adjustment? "The result res.test is a matrix of p-values" but what are the names of the hypothesis tests which can be done? Also, the vignette seems to use randomly-generated data for its demonstration dataset. Please use a real experimental dataset to demonstrate how the R package solves a real-world problem, rather than just an artificial one. You might be able to find one in Bioconductor's ExperimentHub database of experimental data or perhaps include your own dataset or include a public dataset in the data directory. In real datasets which I have analysed, the cell type proportions of samples are not known, but are a key requirement of the input data for ISLET. Also please define "Slope Testing" in one or two sentences. The vignette should be a comprehensive resource for new users of the software.

Response: Thank you for the questions and advice about vignette. We moved the installation instructions to the top and added descriptions about the statistical model used in ISLET (see Introduction). We also added additional descriptions about sequencing library size, covariates, and additional groups in Introduction. The slope test was explained and expanded in the section of ‘Test csDE in change-rate (slope)’.

The package seems to be limited to two classes of samples. Why not three or more (e.g. complete responder, stable disease, progressing disease) classes?.

Response: Thanks for the suggestion. Our modeling for ISLET currently only allows for a two-group comparison design, and estimate the individual-specific reference panels. Our method is the first work in its kind. However, we will continue our investigation in this domain and update ISLET accordingly.

The package seems to only allow one covariate and it must be in the second column of the column data. This seems rather inflexible and rigid. Please change this and make it customisable by the user.

This time/age covariate must be stored in the second column in colData, to successfully execute this testing. How does the package determine which columns are cell type proportions and which are covariates? I see nothing explaining this in dataprep.Rd. If fact, it says "In the colData slot, the subject IDs take the first column, and the cell type proportions take the remaining columns." So it seems like covariates aren't allowed in the column data if I read this, which seems to contradict the vignette's advice.

Response: This is indeed a rather rigid issue. One of the reason is based on the pragmatic consideration: the number of cell types, K, can change from study to study. Our modeling is designed to work with a flexible number of cell types. Rather than asking the users to tell us which column is the group and then specify which columns are the cell types, we took a simple approach: ask the user to store the group information in the second column and leave the remaining columns to store the cell types. We think this is reasonable and simple requirement for users. There is no contradiction here. In the regular test (without slope), the remaining columns are the cell type proportions. In the slope test, the third column is an additional covariate for the time variable, and the remaining columns are the cell type proportions. We have designed two different functions, dataPrep and dataPrepSlope, respectively, to handle them differently.

Is calculating p-values after deconvolution statistically valid? Valid Post-clustering Differential Analysis for Single-cell RNA-Seq.

We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries.

It would be interesting to see some evaluation if the method is anti-conservative.

Response: Thanks for sharing this article with us and we do believe this is a very important and underexplored questions. Although ISLET is not on single cell analysis, similar situation may also apply to deconvolution problems. I believe especially in reference-free deconvolution, methods will try to segregate cell types using the direction in higher dimension that shows the most strong separation. Here for ISLET, it takes cell type proportions as known values, thus we suspect it may suffer less from this issue. With that said, if the cell type proportions are estimated from other deconvolution, the results may still be contaminated and more liberal test statistics can be expected. This is a research topic itself and I would be very interested to explore.

Consider providing a publication-quality visualisation function. Currently, the package only produces text results.

Response: Thanks for the constructive suggestion. In our present research, we focused on the implementation of ISLET framework, and the comparison with other popular methods/tools in reference panel estimation and cell-type-specific differential expression (csDE) analysis. Despite occasional deviance in group effect estimate, we demonstrated the superior performance of ISLET in the above analyses in an extensive simulation study and confirmed cell-type-specific signals previously reported in a large-cohort study. However, we still need to improve ISLET parameter estimation in future work and then benchmark by profiling pure cell RNA-Seq data from the same longitudinal samples. The visualization of real data at the current stage is limited to mixture-cell RNA-Seq data due to the lack of pure-cell data, which is not capable of validating the deconvolved profiles or csDEG. We kindly ask you to let us save this feature for a future version.

DarioS commented 2 years ago

Thanks for these changes. The package and my understanding of its purpose has improved. However, I note that some of the changes replied to above have not been sufficiently implemented.

@lshep Could you comment on the acceptability of

isletTest<-function(input, ncores = min(detectCores()-1, 15))
{
    ...
    if(.Platform$OS.type == "unix") {
    ## do some parallel computation under Unix
      multicoreParam <- MulticoreParam(workers = ncores)
    mf <- bplapply(X=Yall.list, islet.solve.block, datuse = input, BPPARAM = multicoreParam)
    ...

? Although it makes it easy for the package users to avoid knowing about BiocParallelParam class, it also means that they can't adjust oprtions, such as RNGseed and tasks which might make the results less reproducible.

For the vignette, show each data and results variable. Also, explain in the next whether it is real experimental data or simulated data. If simulated, what are the simulation parameters. For example,

> GE600_se # Show what's inside for the vignette reader.
class: SummarizedExperiment 
dim: 50 600 
metadata(0):
assays(1): counts
rownames(50): gene1 gene2 ... gene49 gene50
rowData names(0):
colnames(600): Case_1_1 Case_1_2 ... Ctr_100_2 Ctr_100_3
colData names(8): group subject_ID ... Celltype5 Celltype6

The vignette contains a bunch of data assignments such as res.test <- isletTest(input=study123input) and age.test <- isletTest(input=study456input). Please show the output of these tests and provide a couple of sentences of interpretation to go along with the results tables immediately below the code. What should the user pay attention to when they look at the table?

Section 4 Data Preparation. Code is shown but there is no explanation about what data preparation does. Add one or two sentences about happens during preparation.
The new unit tests file consists of two unit tests. Both of them are simply for input data conversion into an inputSet object. There is no testing of the core of the package which is to do deconvolution and hypothesis tests. Please test your core package functionality in the unit tests.
The code still has more changes to be made before it would conform to the Bioconductor style.

isletSolve<-function(input, ncores = min(detectCores()-1, 15) ){
    # islet.solve only runs on the model without age effect.
    if(input@type=='slope'){
        stop('Input should be prepared by dataprep()')
    }
    if(ncores>detectCores()){
        stop('Number of cores specified above the limit')
    }

    #make Yall a list
    G <- nrow(input@exp_case) # This has the desired style.

We can see that there is still often no space around binary operators, which is requested by Section 15.2.7.2 of the guide. Also, your conversion into camelCase functions and variables missed some instances, such as dataprep in the error message above.

I agree that package code that is not for use by the end user could use single letter variable names to make it closer in appearance to formulae in a journal article. Also, it is not necessary but only suggested to have plots.

Keep up the good work and I look forward to the next version of your package.

vjcitn commented 2 years ago

@DarioS ,Apropos

isletTest<-function(input, ncores = min(detectCores()-1, 15))
{
    ...
    if(.Platform$OS.type == "unix") {
    ## do some parallel computation under Unix
      multicoreParam <- MulticoreParam(workers = ncores)
    mf <- bplapply(X=Yall.list, islet.solve.block, datuse = input, BPPARAM = multicoreParam)
    ...

I discussed this with @lshep. An appropriate approach is to write the function with BPPARAM=SerialParam() as the user-facing parameter, and pass BPPARAM to bplapply. If you want to provide more guard rails for users, define additional helper functions that can build an instance of a subclass of BiocParallelParam, which is passed as BPPARAM.

Let us know if there are further questions.

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 8b8e13732b6262e65cb5cc8902652def91c4038f

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 8dae33655e7b41acdda4fc7a2046c1034902bce2

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

haoharryfeng commented 2 years ago

Hi @DarioS , we have once again addressed these issues and please see below for our modifications and responses.

Thanks for these changes. The package and my understanding of its purpose has improved. However, I note that some of the changes replied to above have not been sufficiently implemented.

@lshep Could you comment on the acceptability of
isletTest<-function(input, ncores = min(detectCores()-1, 15))
{
    ...
    if(.Platform$OS.type == "unix") {
    ## do some parallel computation under Unix
      multicoreParam <- MulticoreParam(workers = ncores)
    mf <- bplapply(X=Yall.list, islet.solve.block, datuse = input, BPPARAM = multicoreParam)
    ...    
? Although it makes it easy for the package users to avoid knowing about BiocParallelParam class, it also means that they can't adjust oprtions, such as RNGseed and tasks which might make the results less reproducible.

Response: We’ve revised the code, according to your suggestion at the initial review, to allow the input of a BiocParallelParam class into deconvolution function and hypothesis testing function. Now the users should have good amount of flexibility to adjust all necessary parameters.

For the vignette, show each data and results variable. Also, explain in the next whether it is real experimental data or simulated data. If simulated, what are the simulation parameters. For example,
> GE600_se # Show what's inside for the vignette reader.
class: SummarizedExperiment 
dim: 50 600 
metadata(0):
assays(1): counts
rownames(50): gene1 gene2 ... gene49 gene50
rowData names(0):
colnames(600): Case_1_1 Case_1_2 ... Ctr_100_2 Ctr_100_3
colData names(8): group subject_ID ... Celltype5 Celltype6
The vignette contains a bunch of data assignments such as res.test <- isletTest(input=study123input) and age.test <- isletTest(input=study456input). Please show the output of these tests and provide a couple of sentences of interpretation to go along with the results tables immediately below the code. What should the user pay attention to when they look at the table?

Response: Thanks for the suggestions. (1) We swapped our example dataset to include a set of real RNA-seq data from TEDDY project (The Environmental Determinants of Diabetes in the Young, https://teddy.epi.usf.edu/ ). The gene names are temporarily replaced with dummy gene index at this moment, due to the fact it’s unpublished dataset. However, they will be replaced back to real gene names once the manuscript is published. The real dataset include gene expression values from 10 genes, among 83 young type-I diabetes cases and 89 healthy controls, with multiple temporal observation samples for each subject. We have included additional descriptions about this in our vignette. (2) We added code and text to display the data and results variables, and expanded detailed descriptions of the input data and output results.

Section 4 Data Preparation. Code is shown but there is no explanation about what data preparation does. Add one or two sentences about happens during preparation.

Response: We have expanded this section to provide more details about our data preparation step. Additionally, we have provided a snapshot of the S4 output from this step, so the user can have a sense of the available elements inside.

The new unit tests file consists of two unit tests. Both of them are simply for input data conversion into an inputSet object. There is no testing of the core of the package which is to do deconvolution and hypothesis tests. Please test your core package functionality in the unit tests.

Response: Thanks for pointing this out. We’ve now added 3 additional unit test functions, to test the core package in terms of deconvolution and cell-type-specific differential testing.

The code still has more changes to be made before it would conform to the Bioconductor style.
isletSolve<-function(input, ncores = min(detectCores()-1, 15) ){
    # islet.solve only runs on the model without age effect.
    if(input@type=='slope'){
        stop('Input should be prepared by dataprep()')
    }
    if(ncores>detectCores()){
        stop('Number of cores specified above the limit')
    }

    #make Yall a list
    G <- nrow(input@exp_case) # This has the desired style.
We can see that there is still often no space around binary operators, which is requested by Section 15.2.7.2 of the guide. Also, your conversion into camelCase functions and variables missed some instances, such as dataprep in the error message above.

Response: Thanks for pointing this out. We’ve tried to identify all coding conformation problems and changed those. Hopefully we identified all of them. But please let us know if we missed any.

I agree that package code that is not for use by the end user could use single letter variable names to make it closer in appearance to formulae in a journal article. Also, it is not necessary but only suggested to have plots.

Keep up the good work and I look forward to the next version of your package.

DarioS commented 2 years ago

Good work. A couple more issues.

Your latest vignette update requests users use @ but only developers should use it. See Class Design.

Generally @ should only be used in an accessor, all other code should use the accessor.
Your final result appears to produce no output.

bioc-issue-bot commented 2 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: e158ebfe08fc2ba369a0c192a1201155dee17a07

bioc-issue-bot commented 2 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/ISLET to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

haoharryfeng commented 2 years ago

Hi @DarioS , we made changes to fix these issues. See below for our point-by-point responses.

Good work. A couple more issues.

Your latest vignette update requests users use @ but only developers should use it. See Class Design.

Generally @ should only be used in an accessor, all other code should use the accessor.

Response: We eliminated and replaced the @ usage in vignette. Instead, we created functions to access the S4 object slots. Now the vignette shows the examples of using these access functions, to extract ISLET estimated reference panels.

Your final result appears to produce no output.

Response: This is now fixed. This was intentional previously, to avoid running this slope test example, in order to save time in vignette building. Since we are now using a very small example dataset, we have now switched to evaluate almost all codes in our vignette. Now the results would appear properly.

DarioS commented 2 years ago

I am satisfied with the latest changes. I will now leave the final decision to Lori Shepherd.

One minor comment. A change to the vignette one week ago displays the contents of the data sets. I wonder why the measurements are stored as an assay named counts if they are decimal numbers rather than integers.

> assays(GE600_se)$counts[1:5, 1:6]
        6454256   1716203    8125261    6264143    5640428   3764673
gene1 51.796233 50.593053  29.870425  55.052521 193.789977 60.638259
gene2  1.486545  1.995879   2.670510   1.915750   1.456993  1.786843
gene3 34.213228 41.148280  50.366612  15.964298  45.800864 22.525563
gene4  5.788845  4.489374   7.776873   1.094732   1.451257  1.388009
gene5 67.010747 76.308153 106.936595 257.108819  86.269951 67.152534

GE600 data set is described as "A gene expression value data frame of 10 genes by 520 sample." Be more specific than just "gene expression value" which could mean almost anything (but not counts ... because of the decimal places).

bioc-issue-bot commented 2 years ago

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

Reviewers for Bioconductor packages are volunteers from the Bioconductor community. If you are interested in becoming a Bioconductor package reviewer, please see Reviewers Expectations.

lshep commented 2 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/haoharryfeng.keys is not empty), then no further steps are required. Otherwise, do the following:

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("ISLET"). The package 'landing page' will be created at

https://bioconductor.org/packages/ISLET

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

Bioconductor / Contributions

ISLET #2695