Closed multimeric closed 1 year ago
Hi @multimeric
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Type: Package
Package: CuratedAtlasQueryR
Title: Queries the Human Cell Atlas
Version: 0.99.1
Authors@R: c(
person(
"Stefano",
"Mangiola",
email = "mangiolastefano@gmail.com",
role = c("aut", "cre", "rev")
),
person(
"Michael",
"Milton",
email = "milton.m@wehi.edu.au",
role = c("aut", "rev")
),
person(
"Martin",
"Morgan",
email = "Martin.Morgan@RoswellPark.org",
role = c("ctb", "rev")
),
person(
"Vincent",
"Carey",
email = "stvjc@channing.harvard.edu",
role = c("ctb", "rev")
),
person(
"Julie",
"Iskander",
email = "iskander.j@wehi.edu.au",
role = c( "rev")
),
person(
"Tony",
"Papenfuss",
email = "papenfuss@wehi.edu.au",
role = c( "rev")
),
person(
"Silicon Valley Foundation",
"CZF2019-002443",
role = c( "fnd")
),
person(
"NIH NHGRI",
"5U24HG004059-18",
role = c( "fnd")
),
person(
"Victoria Cancer Agnency",
"ECRF21036",
role = c( "fnd")
),
person(
"NHMRC",
"1116955",
role = c( "fnd")
))
Description: Provides access to a copy of the Human Cell Atlas, but with
harmonised metadata. This allows for uniform querying across numerous
datasets within the Atlas using common fields such as cell type, tissue
type, and patient ethnicity. Usage involves first querying the metadata
table for cells of interest, and then downloading the corresponding cells
into a SingleCellExperiment object.
License: GPL-3
Depends:
R (>= 4.2.0)
Imports:
dplyr,
SummarizedExperiment,
SingleCellExperiment,
purrr (>= 1.0.0),
BiocGenerics,
glue,
HDF5Array,
DBI,
tools,
httr,
cli,
assertthat,
SeuratObject,
Seurat,
methods,
rlang,
stats,
S4Vectors,
tibble,
utils,
dbplyr (>= 2.3.0),
duckdb,
stringr
Suggests:
zellkonverter,
rmarkdown,
knitr,
testthat,
basilisk,
arrow,
reticulate
Biarch: true
biocViews:
AssayDomain,
Infrastructure,
RNASeq,
DifferentialExpression,
GeneExpression,
Normalization,
Clustering,
QualityControl,
Sequencing,
Transcription,
Transcriptomics
Encoding: UTF-8
RoxygenNote: 7.2.3
LazyDataCompression: xz
URL: https://github.com/stemangiola/CuratedAtlasQueryR
BugReports: https://github.com/stemangiola/CuratedAtlasQueryR/issues
VignetteBuilder: knitr
Roxygen: list(markdown = TRUE)
Collate:
'utils.R'
'counts.R'
'dev.R'
'metadata.R'
'seurat.R'
'unharmonised.R'
'zzz.R'
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.
Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/CuratedAtlasQueryR
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Is there any way to access the logs of the bioc build? Also, are there any known issues relating to network accesses or filesystem usage on the build machine? Because when I run a build locally it finishes in just under 2 minutes, and therefore shouldn't be hitting the 15 minute timeout:
$ gh repo clone stemangiola/CuratedAtlasQueryR
Cloning into 'CuratedAtlasQueryR'...
remote: Enumerating objects: 2078, done.
remote: Counting objects: 100% (817/817), done.
remote: Compressing objects: 100% (355/355), done.
remote: Total 2078 (delta 524), reused 676 (delta 444), pack-reused 1261
Receiving objects: 100% (2078/2078), 1.86 MiB | 1.30 MiB/s, done.
Resolving deltas: 100% (1256/1256), done.
$ /usr/bin/time R CMD build CuratedAtlasQueryR
* checking for file ‘CuratedAtlasQueryR/DESCRIPTION’ ... OK
* preparing ‘CuratedAtlasQueryR’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Omitted ‘LazyDataCompression’ from DESCRIPTION
* building ‘CuratedAtlasQueryR_0.99.1.tar.gz’
86.92user 13.00system 1:45.13elapsed 95%CPU (0avgtext+0avgdata 1515732maxresident)k
2192328inputs+16920outputs (1010major+3387858minor)pagefaults 0swaps
R CMD build
also only takes 216.70 seconds, ie < 4 minutes on my own Mac, using my home internet connection which isn't even that fast. So I'm not sure what is happening here.
Are you using the latest versions of all packages and R 4.3 Bioc 3.17 to test? BiocManager::valid to make sure that you have the expected versions of packages. I just tried locally on my linux box and it is at 15 minutes and it still has not finished building. Is there large amounts of data being downloaded? Are you using caching mechanisms?
Yes, it downloads a decent amount of data to run the vignette. Maybe 500mb at a guess? Is that likely to be an issue? Is the time to install dependencies included in the build time? I thought not.
package dependency installation no. but data download and actively building the vignette yet.
If the downloaded data are properly cached then once the build system has the resources cached, it will not need to download them again and event times will be reduced. Are we talking about the metadata resource or illustrative SCE?
Also 500 mb seems like a lot of data. Could a smaller subset be used? Many times a smaller example is enough for proof of principle.
Also 500 mb seems like a lot of data. Could a smaller subset be used? Many times a smaller example is enough for proof of principle.
It would be possible, just annoying. The metadata file is a bit of a monolith and I'd rather not have to maintain a separate "demo metadata" file that would also complicate the vignette with advanced parameters.
If the downloaded data are properly cached then once the build system has the resources cached, it will not need to download them again and event times will be reduced. Are we talking about the metadata resource or illustrative SCE?
Yes, downloaded data is cached, and I'm fairly certain that is working properly. The largest download is probably the metadata file itself, which is ~450 Mb.
My best guess right now about what is happening is that the data is located in Australia and all of my testing has been in Australia, but the BioC build machine is presumably not. The slower download might be caused by that geographic distance.
Annoying maybe but please keep in mind we have thousands of packages we are building and providing support for. And users might not always have access to 500 mb of space of a system. Is the full metadata file required for full functionality of the package? Where are you hosting the data currently?
The check took quite a bit of time for me too
real 10m7.507s
user 11m3.190s
sys 0m12.881s
FWIW, the installation of duckdb
takes quite a bit of time but this is separate from the R CMD check
.
Hi Michael, @multimeric Thank you for your submission. Please see the review below. Best, Marcel
get_*metadata
into a single function with an
unharmonized = TRUE/FALSE
option.eval=FALSE
in the metadata chunk and others?sessionInfo()
sectionassert_that
call and separate expressions with
commas ,
.force(data)
in get_SingleCellExperiment
?inst
directorytissue_label.csv
seems out of place, perhaps move it into inst/extdata
?Hi @LiNk-NY, thanks for the review.
Is it possible to avoid the eval=FALSE in the metadata chunk and others?
Would you mind clarifying which chunk(s) you are referring to?
Is realized data cached?
Also which data to you mean here?
Would you mind clarifying which chunk(s) you are referring to?
I mean these and a chunk above the linked section:
Is realized data cached?
Also which data to you mean here?
The assay and metadata data. It looks like get_default_cache_dir()
is used to cache and it is possible to cache the connection.
BTW you can knit the vignette into the README.md
with code like in:
https://github.com/waldronlab/TENxIO/blob/devel/inst/scripts/README.Rmd
There are ways to do this automatically via GitHub Actions as well.
AFAIU, saving an object with an HDF5 backend only serializes the seed links within the SingleCellExperiment
assay
. This is the reason it is fast. If the path to the H5 files are altered (e.g., shared) the Rds
file no longer works. The recommended way of saving an object with an HDF5 backend is to use saveHDF5SummarizedExperiment
in HDF5Array
.
Perhaps @hpages can confirm.
AFAIU, saving an object with an HDF5 backend only serializes the seed links within the SingleCellExperiment assay. This is the reason it is fast. If the path to the H5 files are altered (e.g., shared) the Rds file no longer works. The recommended way of saving an object with an HDF5 backend is to use saveHDF5SummarizedExperiment in HDF5Array.
Yes, that is the point being made in the vignette. That saving the unrealised DelayedArray is fast because the computation hasn't happened yet. saveHDF5SummarizedExperiment
is explained in the next section.
It wasn't clear in the vignette and it should probably be discouraged because the Rds
file is not portable.
For example, if you share the Rds
file, the collaborator will get this error because they do not have the original h5
file.
> library(SingleCellExperiment)
> test <- readRDS("/home/rstudio/data/single_cell_counts.rds")
> assay(test)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'type': file '/home/user/.cache/R/CuratedAtlasQueryR/0.2.1/original/bc380dae8b14313a870973697842878b/assays.h5' does not exist
Hello @LiNk-NY, I am trying to push to bioc with
git push upstream devel
but I get
FATAL: W any packages/CuratedAtlasQueryR stemangiola DENIED by fallthru
(or you mis-spelled the reponame)
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Could you please advise
Please try again; I believe we have fixed the issue.
Thanks @lshep ,
now I get a different error
(base) [mangiola.s@rstudio-2 CuratedAtlasQueryR]$ git push upstream devel
error: src refspec devel does not match any.
error: failed to push some refs to 'git@git.bioconductor.org:packages/CuratedAtlasQueryR.git'
Received a valid push on git.bioconductor.org; starting a build for commit id: e05b0073d54f3f9b86c39e16b69a6bc2c977ad8d
@stemangiola That's because our main branch isn't called devel
.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/CuratedAtlasQueryR
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
@LiNk-NY
While we wait for the build report, here is my response to your feedback.
Consider checking for typos (Should "agnency" be "agency"?)
This typo has been fixed. Also I've added a spell checker to our test suite which will detect much such issues: https://github.com/stemangiola/CuratedAtlasQueryR/blob/master/tests/spelling.R. The one you identified was in the name field of the DESCRIPTION so couldn't be automatically detected anyway.
Consider including an ORCiD identifier at least for the maintainer.
Done. Stefano and I have our ORCiDs listed
Consider combining get_*metadata into a single function with an unharmonized = TRUE/FALSE option.
This is something we considered earlier, before submitting. The API is designed this way because the harmonised vs unharmonised metadata are completely different in structure, and we hope the average user will not have to think about get_unharmonised()
at all. The API is designed this way to encourage the user to use get_metadata()
instead.
Minor: Consider using camelCase vs snake_case for exported functions.
The package has been standardised to use snake_case. This only involved changing one function: getSingleCellExperiment
to get_single_cell_experiment
.
Provide a short explain with regards to how the data is hosted and from what project it originates.
This is now mentioned in the README.
Is it possible to avoid the eval=FALSE in the metadata chunk and others?
I've created a minimal metadata file that means the entire vignette is now evaluated.
Is realized data cached?
Still not entirely sure what this refers to, but yes we cache the assay data and metadata. We don't cache user-transformed versions of either, and we leave that up to the user.
Include a sessionInfo() section
This is now in the vignette
Consider using a single assert_that call and separate expressions with commas ,
This is done where possible, but it isn't possible in all cases because each assert_that
can only have one error message per function call.
Perhaps use force(data) in get_SingleCellExperiment?
Done, this is clearer in intention.
dev Consider moving this folder into the inst directory
This isn't installed in the package (it's in the .Rbuildignore
), so hopefully it's location in the repo doesn't matter
BTW you can knit the vignette into the README.md with code like in:
The vignette and the readme have now been unified
It wasn't clear in the vignette and it should probably be discouraged because the Rds file is not portable. For example, if you share the Rds file, the collaborator will get this error because they do not have the original h5 file.
I have added more detail about the tradeoffs of each serialization method.
I'm a bit perplexed by the build failure though. I added a much smaller metadata file as suggested, and it still seems to be timing out. Any input would be appreciated.
I have tried building the package locally and it is taking quite a bit of time.
Hi Michael, @multimeric Thank you for making those changes. A couple of minor points:
BiocStyle
to style the vignette.
Your package has been accepted on the condition that the above will be considered. Best, Marcel
Your package has been accepted. It will be added to the Bioconductor nightly builds.
Thank you for contributing to Bioconductor!
Reviewers for Bioconductor packages are volunteers from the Bioconductor community. If you are interested in becoming a Bioconductor package reviewer, please see Reviewers Expectations.
Thanks for the feedback, will implement those two points. Can you explain why it was accepted despite the build time issue?
@LiNk-NY I would also like verification on this point? It has not built once on the SPB without a timeout? We don't even have a R CMD check and BiocCheck for it?
Hi Michael and Lori, @multimeric @lshep
I ran it on my machine and it passed though it took about 10 minutes. It seemed to me to be an SPB issue but please create a commit and have the SPB give it another go.
Best, Marcel
Does the build machine have a persistent cache? Because I changed the metadata file used in the vignette into a smaller sample one, which made the build much faster. However if the old metadata file were still cached I could see the build taking a while.
@multimeric if you utilized caching mechanisms then yes the cache should be persistent. What would I be looking for to delete out/reset?
CuratedAtlasQueryR:::get_default_cache_dir() |> unlink(TRUE, TRUE, TRUE)
should do it, I think.
Could you bump the version and try again. I reset the cache.
Received a valid push on git.bioconductor.org; starting a build for commit id: ffe7520ef59c2b05b3fa01cdaf4e90bcb5aca892
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, skipped". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/CuratedAtlasQueryR
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hello @LiNk-NY , and All,
thanks a lot for reviewing this package.
Could we please accept the package, as the timeout is due to the long installation of DuckDB, which is not our fault, and there is nothing we can do about it.
thanks a lot.
Hi Stefano, @stemangiola
Thank you for your patience.
The package should not install any packages and I don't think it installs duckdb
while running R CMD build
?
Building the package locally takes less than 2 minutes on my end.
I am not sure what the issue is.
Do you have any ideas Lori? @lshep
I also got a build in a few minutes when I manually attempted. I'm going to manually trigger another build to see what happens and see if it has corrected itself.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/CuratedAtlasQueryR
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Okay the build seems to be making progress now.
ERROR: Maintainer must add package name to Watched Tags on the support site; Edit your Support Site User Profile to add Watched Tags.
I'm not really sure what this means, considering the package isn't yet in BioC, but maybe @stemangiola you can do something here?
I tried, but I cannot. I think first we accept it on Bioconductor, and then I will add the Watched Tags
I guess tag adding and the error message could be clarified. Edit your profile at support.bioconductor.org. Go to "My tags" field and add any string. It becomes a watched tag. @LiNk-NY am I right? Should we provide more doc on this in BiocCheck?
The package doesn't have to be in Bioconductor to be able to add a Watched Tag. The maintainer email must agree in both the package DESCRIPTION
and support site. It looks like @stemangiola has added the tag in the screenshot. It should work with a version bump.
Hi Stefano, @stemangiola Any updates on this?
@multimeric In the meantime, please resolve this warning for formals consistency:
* checking S3 generic/method consistency ... WARNING
as.sparse:
function(x, ...)
as.sparse.DelayedMatrix:
function(x)
See section 'Generic functions and methods' in the 'Writing R
Extensions' manual.
Once ready, please bump the version and push
for another build.
-Marcel
@multimeric I had a look but I'm not sure about the logics of as.sparse. Would you mind fix and push to bioconductor.
We might be on the edge of acceptance.
Thanks a lot.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
[x] I am familiar with the Bioconductor code of conduct and agree to abide by it.
I am familiar with the essential aspects of Bioconductor software management, including:
For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.