Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

CuratedAtlasQueryR #2991

Closed multimeric closed 1 year ago

multimeric commented 1 year ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 1 year ago

Hi @multimeric

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Type: Package
Package: CuratedAtlasQueryR
Title: Queries the Human Cell Atlas
Version: 0.99.1
Authors@R: c(
    person(
        "Stefano",
        "Mangiola",
        email = "mangiolastefano@gmail.com",
        role = c("aut", "cre", "rev")
    ),
    person(
        "Michael",
        "Milton",
        email = "milton.m@wehi.edu.au",
        role = c("aut", "rev")
    ),
    person(
        "Martin",
        "Morgan",
        email = "Martin.Morgan@RoswellPark.org",
        role = c("ctb", "rev")
    ),
    person(
        "Vincent",
        "Carey",
        email = "stvjc@channing.harvard.edu",
        role = c("ctb", "rev")
    ),
    person(
        "Julie",
        "Iskander",
        email = "iskander.j@wehi.edu.au",
        role = c( "rev")
    ),
    person(
        "Tony",
        "Papenfuss",
        email = "papenfuss@wehi.edu.au",
        role = c( "rev")
    ),
    person(
        "Silicon Valley Foundation",
        "CZF2019-002443",
        role = c( "fnd")
    ),
    person(
        "NIH NHGRI",
        "5U24HG004059-18",
        role = c( "fnd")
    ),
    person(
        "Victoria Cancer Agnency",
        "ECRF21036",
        role = c( "fnd")
    ),
    person(
        "NHMRC",
        "1116955",
        role = c( "fnd")
    ))
Description: Provides access to a copy of the Human Cell Atlas, but with 
    harmonised metadata. This allows for uniform querying across numerous 
    datasets within the Atlas using common fields such as cell type, tissue 
    type, and patient ethnicity. Usage involves first querying the metadata
    table for cells of interest, and then downloading the corresponding cells 
    into a SingleCellExperiment object. 
License: GPL-3
Depends:
    R (>= 4.2.0)
Imports:
    dplyr,
    SummarizedExperiment,
    SingleCellExperiment,
    purrr (>= 1.0.0),
    BiocGenerics,
    glue,
    HDF5Array,
    DBI,
    tools,
    httr,
    cli,
    assertthat,
    SeuratObject,
    Seurat,
    methods,
    rlang,
    stats,
    S4Vectors,
    tibble,
    utils,
    dbplyr (>= 2.3.0),
    duckdb,
    stringr
Suggests:
    zellkonverter,
    rmarkdown,
    knitr,
    testthat,
    basilisk,
    arrow,
    reticulate
Biarch: true
biocViews: 
    AssayDomain,
    Infrastructure,
    RNASeq,
    DifferentialExpression,
    GeneExpression,
    Normalization,
    Clustering,
    QualityControl,
    Sequencing,
    Transcription,
    Transcriptomics
Encoding: UTF-8
RoxygenNote: 7.2.3
LazyDataCompression: xz
URL: https://github.com/stemangiola/CuratedAtlasQueryR
BugReports: https://github.com/stemangiola/CuratedAtlasQueryR/issues
VignetteBuilder: knitr
Roxygen: list(markdown = TRUE)
Collate: 
    'utils.R'
    'counts.R'
    'dev.R'
    'metadata.R'
    'seurat.R'
    'unharmonised.R'
    'zzz.R'
bioc-issue-bot commented 1 year ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CuratedAtlasQueryR to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

multimeric commented 1 year ago

Is there any way to access the logs of the bioc build? Also, are there any known issues relating to network accesses or filesystem usage on the build machine? Because when I run a build locally it finishes in just under 2 minutes, and therefore shouldn't be hitting the 15 minute timeout:

$ gh repo clone stemangiola/CuratedAtlasQueryR
Cloning into 'CuratedAtlasQueryR'...
remote: Enumerating objects: 2078, done.
remote: Counting objects: 100% (817/817), done.
remote: Compressing objects: 100% (355/355), done.
remote: Total 2078 (delta 524), reused 676 (delta 444), pack-reused 1261
Receiving objects: 100% (2078/2078), 1.86 MiB | 1.30 MiB/s, done.
Resolving deltas: 100% (1256/1256), done.
$ /usr/bin/time R CMD build CuratedAtlasQueryR
* checking for file ‘CuratedAtlasQueryR/DESCRIPTION’ ... OK
* preparing ‘CuratedAtlasQueryR’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Omitted ‘LazyDataCompression’ from DESCRIPTION
* building ‘CuratedAtlasQueryR_0.99.1.tar.gz’

86.92user 13.00system 1:45.13elapsed 95%CPU (0avgtext+0avgdata 1515732maxresident)k
2192328inputs+16920outputs (1010major+3387858minor)pagefaults 0swaps
multimeric commented 1 year ago

R CMD build also only takes 216.70 seconds, ie < 4 minutes on my own Mac, using my home internet connection which isn't even that fast. So I'm not sure what is happening here.

lshep commented 1 year ago

Are you using the latest versions of all packages and R 4.3 Bioc 3.17 to test? BiocManager::valid to make sure that you have the expected versions of packages. I just tried locally on my linux box and it is at 15 minutes and it still has not finished building. Is there large amounts of data being downloaded? Are you using caching mechanisms?

multimeric commented 1 year ago

Yes, it downloads a decent amount of data to run the vignette. Maybe 500mb at a guess? Is that likely to be an issue? Is the time to install dependencies included in the build time? I thought not.

lshep commented 1 year ago

package dependency installation no. but data download and actively building the vignette yet.

vjcitn commented 1 year ago

If the downloaded data are properly cached then once the build system has the resources cached, it will not need to download them again and event times will be reduced. Are we talking about the metadata resource or illustrative SCE?

lshep commented 1 year ago

Also 500 mb seems like a lot of data. Could a smaller subset be used? Many times a smaller example is enough for proof of principle.

multimeric commented 1 year ago

Also 500 mb seems like a lot of data. Could a smaller subset be used? Many times a smaller example is enough for proof of principle.

It would be possible, just annoying. The metadata file is a bit of a monolith and I'd rather not have to maintain a separate "demo metadata" file that would also complicate the vignette with advanced parameters.

If the downloaded data are properly cached then once the build system has the resources cached, it will not need to download them again and event times will be reduced. Are we talking about the metadata resource or illustrative SCE?

Yes, downloaded data is cached, and I'm fairly certain that is working properly. The largest download is probably the metadata file itself, which is ~450 Mb.

My best guess right now about what is happening is that the data is located in Australia and all of my testing has been in Australia, but the BioC build machine is presumably not. The slower download might be caused by that geographic distance.

lshep commented 1 year ago

Annoying maybe but please keep in mind we have thousands of packages we are building and providing support for. And users might not always have access to 500 mb of space of a system. Is the full metadata file required for full functionality of the package? Where are you hosting the data currently?

LiNk-NY commented 1 year ago

The check took quite a bit of time for me too

real    10m7.507s
user    11m3.190s
sys 0m12.881s

FWIW, the installation of duckdb takes quite a bit of time but this is separate from the R CMD check.

LiNk-NY commented 1 year ago

Hi Michael, @multimeric Thank you for your submission. Please see the review below. Best, Marcel


CuratedAtlasQueryR #2991

DESCRIPTION

NAMESPACE

vignettes

R

tests

dev

/

multimeric commented 1 year ago

Hi @LiNk-NY, thanks for the review.

Is it possible to avoid the eval=FALSE in the metadata chunk and others?

Would you mind clarifying which chunk(s) you are referring to?

Is realized data cached?

Also which data to you mean here?

LiNk-NY commented 1 year ago

Would you mind clarifying which chunk(s) you are referring to?

I mean these and a chunk above the linked section:

https://github.com/stemangiola/CuratedAtlasQueryR/blob/d2bf12e09f18ed6e4331214b03e1a11e4d40e997/vignettes/Introduction.Rmd#L171-L293

Is realized data cached?

Also which data to you mean here?

The assay and metadata data. It looks like get_default_cache_dir() is used to cache and it is possible to cache the connection.

LiNk-NY commented 1 year ago

BTW you can knit the vignette into the README.md with code like in: https://github.com/waldronlab/TENxIO/blob/devel/inst/scripts/README.Rmd There are ways to do this automatically via GitHub Actions as well.

LiNk-NY commented 1 year ago

https://github.com/stemangiola/CuratedAtlasQueryR/blob/d2bf12e09f18ed6e4331214b03e1a11e4d40e997/vignettes/Introduction.Rmd#L165-L174

AFAIU, saving an object with an HDF5 backend only serializes the seed links within the SingleCellExperiment assay. This is the reason it is fast. If the path to the H5 files are altered (e.g., shared) the Rds file no longer works. The recommended way of saving an object with an HDF5 backend is to use saveHDF5SummarizedExperiment in HDF5Array. Perhaps @hpages can confirm.

multimeric commented 1 year ago

AFAIU, saving an object with an HDF5 backend only serializes the seed links within the SingleCellExperiment assay. This is the reason it is fast. If the path to the H5 files are altered (e.g., shared) the Rds file no longer works. The recommended way of saving an object with an HDF5 backend is to use saveHDF5SummarizedExperiment in HDF5Array.

Yes, that is the point being made in the vignette. That saving the unrealised DelayedArray is fast because the computation hasn't happened yet. saveHDF5SummarizedExperiment is explained in the next section.

LiNk-NY commented 1 year ago

It wasn't clear in the vignette and it should probably be discouraged because the Rds file is not portable. For example, if you share the Rds file, the collaborator will get this error because they do not have the original h5 file.

> library(SingleCellExperiment)
> test <- readRDS("/home/rstudio/data/single_cell_counts.rds")
> assay(test)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'type': file '/home/user/.cache/R/CuratedAtlasQueryR/0.2.1/original/bc380dae8b14313a870973697842878b/assays.h5' does not exist
stemangiola commented 1 year ago

Hello @LiNk-NY, I am trying to push to bioc with

git push upstream devel

but I get

FATAL: W any packages/CuratedAtlasQueryR stemangiola DENIED by fallthru
(or you mis-spelled the reponame)
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Could you please advise

lshep commented 1 year ago

Please try again; I believe we have fixed the issue.

stemangiola commented 1 year ago

Thanks @lshep ,

now I get a different error

(base) [mangiola.s@rstudio-2 CuratedAtlasQueryR]$ git push upstream devel
error: src refspec devel does not match any.
error: failed to push some refs to 'git@git.bioconductor.org:packages/CuratedAtlasQueryR.git'
bioc-issue-bot commented 1 year ago

Received a valid push on git.bioconductor.org; starting a build for commit id: e05b0073d54f3f9b86c39e16b69a6bc2c977ad8d

multimeric commented 1 year ago

@stemangiola That's because our main branch isn't called devel.

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CuratedAtlasQueryR to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

multimeric commented 1 year ago

@LiNk-NY

While we wait for the build report, here is my response to your feedback.

Consider checking for typos (Should "agnency" be "agency"?)

This typo has been fixed. Also I've added a spell checker to our test suite which will detect much such issues: https://github.com/stemangiola/CuratedAtlasQueryR/blob/master/tests/spelling.R. The one you identified was in the name field of the DESCRIPTION so couldn't be automatically detected anyway.

Consider including an ORCiD identifier at least for the maintainer.

Done. Stefano and I have our ORCiDs listed

Consider combining get_*metadata into a single function with an unharmonized = TRUE/FALSE option.

This is something we considered earlier, before submitting. The API is designed this way because the harmonised vs unharmonised metadata are completely different in structure, and we hope the average user will not have to think about get_unharmonised() at all. The API is designed this way to encourage the user to use get_metadata() instead.

Minor: Consider using camelCase vs snake_case for exported functions.

The package has been standardised to use snake_case. This only involved changing one function: getSingleCellExperiment to get_single_cell_experiment.

Provide a short explain with regards to how the data is hosted and from what project it originates.

This is now mentioned in the README.

Is it possible to avoid the eval=FALSE in the metadata chunk and others?

I've created a minimal metadata file that means the entire vignette is now evaluated.

Is realized data cached?

Still not entirely sure what this refers to, but yes we cache the assay data and metadata. We don't cache user-transformed versions of either, and we leave that up to the user.

Include a sessionInfo() section

This is now in the vignette

Consider using a single assert_that call and separate expressions with commas ,

This is done where possible, but it isn't possible in all cases because each assert_that can only have one error message per function call.

Perhaps use force(data) in get_SingleCellExperiment?

Done, this is clearer in intention.

dev Consider moving this folder into the inst directory

This isn't installed in the package (it's in the .Rbuildignore), so hopefully it's location in the repo doesn't matter

BTW you can knit the vignette into the README.md with code like in:

The vignette and the readme have now been unified

It wasn't clear in the vignette and it should probably be discouraged because the Rds file is not portable. For example, if you share the Rds file, the collaborator will get this error because they do not have the original h5 file.

I have added more detail about the tradeoffs of each serialization method.

multimeric commented 1 year ago

I'm a bit perplexed by the build failure though. I added a much smaller metadata file as suggested, and it still seems to be timing out. Any input would be appreciated.

LiNk-NY commented 1 year ago

I have tried building the package locally and it is taking quite a bit of time.

LiNk-NY commented 1 year ago

Hi Michael, @multimeric Thank you for making those changes. A couple of minor points:

Your package has been accepted on the condition that the above will be considered. Best, Marcel

bioc-issue-bot commented 1 year ago

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

Reviewers for Bioconductor packages are volunteers from the Bioconductor community. If you are interested in becoming a Bioconductor package reviewer, please see Reviewers Expectations.

multimeric commented 1 year ago

Thanks for the feedback, will implement those two points. Can you explain why it was accepted despite the build time issue?

lshep commented 1 year ago

@LiNk-NY I would also like verification on this point? It has not built once on the SPB without a timeout? We don't even have a R CMD check and BiocCheck for it?

LiNk-NY commented 1 year ago

Hi Michael and Lori, @multimeric @lshep

I ran it on my machine and it passed though it took about 10 minutes. It seemed to me to be an SPB issue but please create a commit and have the SPB give it another go.

Best, Marcel

multimeric commented 1 year ago

Does the build machine have a persistent cache? Because I changed the metadata file used in the vignette into a smaller sample one, which made the build much faster. However if the old metadata file were still cached I could see the build taking a while.

lshep commented 1 year ago

@multimeric if you utilized caching mechanisms then yes the cache should be persistent. What would I be looking for to delete out/reset?

multimeric commented 1 year ago

CuratedAtlasQueryR:::get_default_cache_dir() |> unlink(TRUE, TRUE, TRUE) should do it, I think.

lshep commented 1 year ago

Could you bump the version and try again. I reset the cache.

bioc-issue-bot commented 1 year ago

Received a valid push on git.bioconductor.org; starting a build for commit id: ffe7520ef59c2b05b3fa01cdaf4e90bcb5aca892

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, skipped". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CuratedAtlasQueryR to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

stemangiola commented 1 year ago

Hello @LiNk-NY , and All,

thanks a lot for reviewing this package.

Could we please accept the package, as the timeout is due to the long installation of DuckDB, which is not our fault, and there is nothing we can do about it.

thanks a lot.

LiNk-NY commented 1 year ago

Hi Stefano, @stemangiola

Thank you for your patience.

The package should not install any packages and I don't think it installs duckdb while running R CMD build?

Building the package locally takes less than 2 minutes on my end.
I am not sure what the issue is.

Do you have any ideas Lori? @lshep

lshep commented 1 year ago

I also got a build in a few minutes when I manually attempted. I'm going to manually trigger another build to see what happens and see if it has corrected itself.

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/CuratedAtlasQueryR to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

multimeric commented 1 year ago

Okay the build seems to be making progress now.

ERROR: Maintainer must add package name to Watched Tags on the support site; Edit your Support Site User Profile to add Watched Tags.

I'm not really sure what this means, considering the package isn't yet in BioC, but maybe @stemangiola you can do something here?

stemangiola commented 1 year ago

I tried, but I cannot. I think first we accept it on Bioconductor, and then I will add the Watched Tags

image
vjcitn commented 1 year ago

I guess tag adding and the error message could be clarified. Edit your profile at support.bioconductor.org. Go to "My tags" field and add any string. It becomes a watched tag. @LiNk-NY am I right? Should we provide more doc on this in BiocCheck?

tags

LiNk-NY commented 1 year ago

The package doesn't have to be in Bioconductor to be able to add a Watched Tag. The maintainer email must agree in both the package DESCRIPTION and support site. It looks like @stemangiola has added the tag in the screenshot. It should work with a version bump.

LiNk-NY commented 1 year ago

Hi Stefano, @stemangiola Any updates on this?

@multimeric In the meantime, please resolve this warning for formals consistency:

* checking S3 generic/method consistency ... WARNING
as.sparse:
  function(x, ...)
as.sparse.DelayedMatrix:
  function(x)
See section 'Generic functions and methods' in the 'Writing R
Extensions' manual.

Once ready, please bump the version and push for another build.

-Marcel

stemangiola commented 1 year ago

@multimeric I had a look but I'm not sure about the logics of as.sparse. Would you mind fix and push to bioconductor.

We might be on the edge of acceptance.

Thanks a lot.