Bioconductor / Contributions

Contribute Packages to Bioconductor
131 stars 33 forks source link

smoothclust #3346

Closed lmweber closed 2 months ago

lmweber commented 3 months ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 3 months ago

Hi @lmweber

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: smoothclust
Version: 0.99.0
Title: smoothclust
Description: Method for segmentation of spatial domains and spatially-aware 
    clustering in spatial transcriptomics data. The method generates spatial 
    domains with smooth boundaries by smoothing gene expression profiles across 
    neighboring spatial locations, followed by unsupervised clustering. Spatial 
    domains consisting of consistent mixtures of cell types may then be further 
    investigated by applying cell type compositional analyses or differential 
    analyses.
Authors@R: c(
    person("Lukas M.", "Weber", 
 email = "lmweberedu@gmail.com", 
 role = c("aut", "cre"), 
 comment = c(ORCID = "0000-0002-3282-1730")))
URL: https://github.com/lmweber/smoothclust
BugReports: https://github.com/lmweber/smoothclust/issues
License: MIT + file LICENSE
Encoding: UTF-8
biocViews: 
    Spatial, 
    SingleCell, 
    Transcriptomics, 
    GeneExpression, 
    Clustering
Depends: 
    R (>= 4.4.0)
Imports: 
    SpatialExperiment, 
    SummarizedExperiment, 
    sparseMatrixStats, 
    spdep, 
    methods, 
    utils
VignetteBuilder: knitr
Suggests: 
    BiocStyle, 
    knitr, 
    STexampleData, 
    scuttle, 
    scran, 
    scater, 
    ggspavis, 
    testthat
RoxygenNote: 7.3.1
bioc-issue-bot commented 3 months ago

Your package has been added to git.bioconductor.org to continue the pre-review process. A build report will be posted shortly. Please fix any ERROR and WARNING in the build report before a reviewer is assigned or provide a justification on why you feel the ERROR or WARNING should be granted an exception.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. All changes should be pushed to git.bioconductor.org moving forward. It is required to push a version bump to git.bioconductor.org to trigger a new build report.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 3 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): smoothclust_0.99.0.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/smoothclust to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 3 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: e5f25e5969af1c18755d8df77e5b992f3a8cc200

bioc-issue-bot commented 3 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): smoothclust_0.99.1.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/smoothclust to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot commented 3 months ago

A reviewer has been assigned to your package for an indepth review. Please respond accordingly to any further comments from the reviewer.

HelenaLC commented 3 months ago

Thanks for the submission, Lukas. Since there's not a huge amount of code, I (perhaps over-)focused on implementation of that. That been said, the package is in decent shape already and most of the points below are fairly optional (!), but felt worthy of being pointed out nevertheless. Please respond back with what has (not) been addressed and/or any questions!

docs

code

# smoothness_metric.R lines 87+
neigh_labels <- vapply(
  seq_len(k), \(i) labels[neigh[, i]], 
  vector(mode(labels), length(labels)))
# 'stopifnot()' in line 93 becomes redundant
# because 'vapply()' already asserts this
weights <- exp(-dists/bandwidth_scaled)
keep <- weights >= truncated

# as opposed to

exp_kernel <- function(d) {exp(-d / bandwidth_scaled)}
weights <- lapply(dists, exp_kernel)
keep <- lapply(weights, function(w) {w >= truncate})
lmweber commented 3 months ago

Hi Helena, thanks for the review!

Yes, these are all good points, thanks. I'll work on them and will push an update soon!

bioc-issue-bot commented 3 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 13f687533d1d53a7fb33aeb7428006ba779823fb

bioc-issue-bot commented 3 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): smoothclust_0.99.4.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/smoothclust to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

lmweber commented 3 months ago

Just pushing some partial updates above for now. I'll add some additional updates and a complete reply to the comments soon.

Thanks for this comment especially:

I'd consider to have smoothclust() having separate arguments for input/output assays; just thinking that users can always choose to overwrite, but it's usually a good idea to keep the raw data, e.g., other methods (like normalization, dimension reduction) also add not replace slots (but that's just my take...)

I think you are right -- much better to add the new values in a new assay instead of overwriting the original one. I have changed this now and updated the vignette example to show how to access the new assay.

Also regarding parallelization, I tried this now using BiocParallel, but I don't think it works since it is grabbing a different subset of columns at each iteration, so it needs to load the whole dataset at each iteration instead of slicing it up at the start. So I don't think this is as well suited to parallelization as some of my previous packages. I'll think about this some more to be sure though. Some sort of giant sparse tensor or ragged multi-dimensional array might work in theory too, although this is beyond my programming ability (and might even be better with Python and GPUs etc).

Thanks -- will add some more updates and post a complete response soon.

bioc-issue-bot commented 3 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 48465e596f0ef5f8591db8004835025e0c7fcd51

bioc-issue-bot commented 3 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): smoothclust_0.99.5.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/smoothclust to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

lmweber commented 3 months ago

Hi, thanks again for the review. I have pushed another updated version above, and below are my detailed replies to the comments. Thank you!


docs

in the README, please add Bioc installation instructions using BiocManager::install()

Done

in the vignette, replace remotes by BiocManager installation instructions

Done

fwiw, you can use BiocStyle::Biocpkg() (and analogous functions for non-Bioc packages) to hyperlink external packages

I left these, mostly as a style preference, since I find pasting them in as links makes it easier to copy-paste into the readme file and other external docs / code examples where I haven't loaded BiocStyle.

I'd suggest using grepl("(^mt-)", ..., ignore.case=TRUE) or use the exact pattern matching your data rather than trying to cover different cases (just in case users read'n'copy this).

Done (in both vignette and examples)

I'd be appreciated if you could add some more (descriptive, methodological etc.) details on the different smoothing options available and the effects of corresponding parameters on the result; perhaps somewhere around where smoothclust() is being run.

I added some more details to mention the available methods and arguments in the vignette, and added a reference to the function documentation, which has a lot more detail.

code

commented-out code should be removed for release (e.g., smoothclust.R lines 175+)

Done

if this is intentional, all good, but just wanted to point out stopifnot() needn't be repeated but can contain an arbitrary number of comma-separated conditions

I updated this in some places and left it in others. This is more of a style preference -- in some cases I find it helpful to have them on separate lines so I can easily run them individually while I am testing the code.

perhaps worth considering a couple more argument checks in smoothclust() (e.g., assay_name should be length 1, bandwidth/truncate/k should be numeric scalars, keep_unsmoothed should be a logical scalar)

Done -- added several additional argument checks

also, think it might be worth expanding on spatial_coords validity checks (smoothclust.R lines 122+), i.e., should be a two-column matrix; smoothness_metric() also checks this and I can see a lot of code that'll fail otherwise

Done, thanks. I added additional checks in both smoothclust() and smoothclust_metric().

related to the above, I think line 122 should check that spatial_coords aren't NULL (which they can in priniciple be in an SPE...)

Done, thanks

since smoothclust() seems to iterate over observations, might be worth exposing a BPPARAM argument or similar to enable parallelization when possible

I thought about this, as discussed in my previous message above, and I think parallelization doesn't work (or work easily) in this package because each iteration needs to access a different set of columns from the input data (i.e. the sets of neighbors, which are different for each point). So I'm pretty sure that in this case parallelization doesn't really help, since it needs to go back to the original data each time. I may be wrong and there could be some more complicated way to do it (e.g. a giant sparse tensor product? not sure if that makes sense), but will leave it for now, since the calculations are correct, just somewhat slow.

I'd consider to have smoothclust() having separate arguments for input/output assays; just thinking that users can always choose to overwrite, but it's usually a good idea to keep the raw data, e.g., other methods (like normalization, dimension reduction) also add not replace slots (but that's just my take...)

Thanks for this comment. Yes, I agree, this is better. As discussed in my previous comment above, I have changed it to leave the original counts assay unchanged, and store the new values in a new assay named counts_smooth (or <assay_name>_smooth if a different assay_name argument is provided). I have also updated the example in the vignette to then use logNormCounts(spe, assay.type = "counts_smooth") instead, and then the rest of the workflow is unchanged.

in smoothclust.R lines 172-173, use vapply() instead of sapply()

Done

smoothness_metric.R lines 97-100 can be computed via rowSums(labels != neigh_labels)

Done, thanks

for-loops should be avoided in favor of v/lapply() where there are no side-effects or dependencies on previous iterations (see here); e.g., see the following code chunk, but there are more instances in smoothclust.R where this might be applicable as well:

I left these as for loops, mainly as a style preference. For these types of calculations, I find it easier to visualize what is going on when it is in a for loop instead of inside lapply(). Especially for these main calculations at the core of the methodology, I want to be really sure that I can see how the calculations are working. I agree with the info at the link though on pre-filling vectors / matrices, and have done this by pre-filling the results vectors (vals_smooth) with NAs of the right length and then storing the results. In this case I believe there is no speed difference between loops vs. lapply(). (On the other hand for parallelization I would need to use bplapply/bpmapply instead.)

I find the kernal method lines 162+ rather curiously implemented; if dists were put inside a matrix, all lapply() calls could be handled by base-R / matrix operations; e.g., see below (but I'm obviously not familiar with all the code, so this is just a suggestion/sth that caught my attention...)

I think this is due to the same issue as the parallelization point above -- this doesn't work since the distances are different for each point, so it is a long list of matrices instead of a single matrix. I could be wrong and there may be some more efficient way to do it, but I left this for now, as long as the calculations are correct.

lmweber commented 3 months ago

I also made the following additional updates, not mentioned above:

HelenaLC commented 2 months ago

Looks all well to me now (and appreciate the good-to-begin-with state of things like documentation etc.) - thanks!

*remember to track changes in the NEWS file.

bioc-issue-bot commented 2 months ago

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

Reviewers for Bioconductor packages are volunteers from the Bioconductor community. If you are interested in becoming a Bioconductor package reviewer, please see Reviewers Expectations.

lmweber commented 2 months ago

Awesome, thanks!

lshep commented 2 months ago

The default branch of your GitHub repository has been added to Bioconductor's git repository as branch devel.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/lmweber.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("smoothclust"). The package 'landing page' will be created at

https://bioconductor.org/packages/smoothclust

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.