bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

Submit to CRAN or CONDA #51

Open shahrozeabbas opened 8 months ago

shahrozeabbas commented 8 months ago

Hi,

Was just curious if there was a plan to upload to CRAN with the release of Seuratv5? I know these are independent packages, but since Seurat v5 depends on BPCells, would be nice to install via CRAN or CONDA.

Thanks 🙂

bnprks commented 8 months ago

Good question -- yes, it is planned to release BPCells through CRAN but it will probably be a little while (I'd estimate 2-3 months). There are a couple technical changes to the compiled code in order to meet CRAN's portability requirements, and first-time CRAN submissions also have pretty stringent documentation requirements such as examples for every public function.

I don't have a plan for conda release right now, though could be convinced to do so by someone with experience releasing R packages in conda.

I'll also mention that BPCells has some pre-built packages for Windows, Mac, and Ubuntu Jammy available through R-universe. I don't actively check that these builds are working, but they should automatically track the github main branch and help skip C++ compilation time during install.

ycli1995 commented 8 months ago

Hi, @bnprks and @shahrozeabbas,

I agree that it would be expected that BPCells can be submitted to CRAN.

I'm currently trying to build an R package to provide some sub-classes of SingleCellExperiment where assays would depend on BPCells matrices to store single-cell data. In my experiences, IterableMatrix backed on disk files performs better than DelayedArray on IO and most matrix mathematics. I can see that some day more excellent features would work for BPCells, such as holding project metadata along with it, and interoperation between SingleCellExperiment, SeuratObject, AnnData and so on. Therefore, it would be definitely apprecitated that BPCells can become an generally accessible dependency package for R community.

shahrozeabbas commented 7 months ago

Just circling back to this. As much as having BPCells on CRAN would be nice, I think adding it to Bioconductor would allow for it to be picked up by Bioconda automatically. I'm not 100% sure about this, but just a idea.

Adding to both CRAN and Bioconductor could be nice. Although there is overlap in the process, that's obviously more of an ask.

mainyanghr commented 7 months ago

very frustrating to install this package and spend for several days and still did not work on M1 Mac.

bnprks commented 7 months ago

Hi @shahrozeabbas, I hadn't heard of a specific automated connection between bioconda and Bioconductor but it's an interesting thing to consider. Unfortunately the rules of Bioconductor disallow submitting a package that exists on CRAN (and CRAN at least disallows sharing a package name with a Bioconductor package), so I think it has to be one or the other.

There are definitely unique advantages to both CRAN and Bioconductor, though I'm currently leaning towards CRAN as it is the default source when using install.packages and allows a more flexible update schedule. Bioconductor has its merits too, such as more coordinated systems for testing cross-compatibility during version changes, but from where I am right now I think ease of installation might win out.

shahrozeabbas commented 7 months ago

@bnprks Yeah I think you're right, don't believe there is any automated connection for it. I agree though, CRAN may be more useful.

If I am able to submit something to Anaconda in R, I will be sure to reach out.

rschauner commented 5 months ago

Good question -- yes, it is planned to release BPCells through CRAN but it will probably be a little while (I'd estimate 2-3 months). There are a couple technical changes to the compiled code in order to meet CRAN's portability requirements, and first-time CRAN submissions also have pretty stringent documentation requirements such as examples for every public function.

I don't have a plan for conda release right now, though could be convinced to do so by someone with experience releasing R packages in conda.

I'll also mention that BPCells has some pre-built packages for Windows, Mac, and Ubuntu Jammy available through R-universe. I don't actively check that these builds are working, but they should automatically track the github main branch and help skip C++ compilation time during install.

I was able to get a working conda binary (for Linux only) using the following set of files and running conda build and uploaded it to Anaconda.

meta.yaml ```yaml {% set version = 'v0.1.0' %} {% set posix = 'm2-' if win else '' %} {% set native = 'm2w64-' if win else '' %} package: name: r-bpcells version: {{ version|replace("-", "_") }} source: git_rev: {{ version }} git_url: https://github.com/bnprks/BPCells.git build: merge_build_host: True # [win] # If this is a new build for the same version, increment the build number. number: 0 # no skip # This is required to make R link correctly on Linux. rpaths: - lib/R/lib/ - lib/ requirements: build: - {{ compiler('c') }} # [not win] - {{ compiler('m2w64_c') }} # [win] - {{ compiler('cxx') }} # [not win] - {{ compiler('m2w64_cxx') }} # [win] - {{ posix }}filesystem # [win] - {{ posix }}make - {{ posix }}sed # [win] - {{ posix }}coreutils # [win] host: - r-base >=4.3 - r-rcpp>=1.0.7 - r-magrittr - r-matrix - r-rlang - r-vctrs - r-stringr - r-tibble - r-dplyr - r-tidyr - r-ggplot2 - r-scales - r-patchwork - r-scattermore - r-ggrepel - r-rcolorbrewer - r-hexbin - r-rcppeigen - hdf5 run: - r-base >=4.3 - {{native}}gcc-libs # [win] - r-rcpp>=1.0.7 - r-magrittr - r-matrix - r-rlang - r-vctrs - r-stringr - r-tibble - r-dplyr - r-tidyr - r-ggplot2 - r-scales - r-patchwork - r-scattermore - r-ggrepel - r-rcolorbrewer - r-hexbin - r-rcppeigen test: commands: # You can put additional test commands to be run here. - $R -e "library('BPCells')" # [not win] - "\"%R%\" -e \"library('BPCells')\"" # [win] about: home: https://bnprks.github.io/BPCells/index.html summary: Efficient operations for single cell ATAC-seq fragments and RNA counts matrices. Interoperable with standard file formats, and introduces efficient bit-packed formats that allow large storage savings and increased read speeds. license: MIT license_family: MIT license_file: - '{{ environ["PREFIX"] }}/lib/R/share/licenses/MIT' ```
build.sh ```bash #!/bin/bash # 'Autobrew' is being used by more and more packages these days # to grab static libraries from Homebrew bottles. These bottles # are fetched via Homebrew's --force-bottle option which grabs # a bottle for the build machine which may not be macOS 10.9. # Also, we want to use conda packages (and shared libraries) for # these 'system' dependencies. See: # https://github.com/jeroen/autobrew/issues/3 export DISABLE_AUTOBREW=1 # R refuses to build packages that mark themselves as Priority: Recommended mv DESCRIPTION DESCRIPTION.old grep -va '^Priority: ' DESCRIPTION.old > DESCRIPTION # shellcheck disable=SC2086 ${R} CMD INSTALL --build . ${R_ARGS} ```
build.bat ```bat "%R%" CMD INSTALL --build . %R_ARGS% IF %ERRORLEVEL% NEQ 0 exit /B 1 ```
bnprks commented 5 months ago

Hi @rschauner, thanks for setting this up! Hopefully this will make installation faster for conda users.

Just one change I'd suggest to improve portability of the build: could you set the environment variable BPCELLS_DISABLE_MARCH_NATIVE prior to the R installation in your build scripts?

Explanation of -march=flags and new way to disable it BPCells uses the compile flag `-march=native` by default, which results in a build that utilizes all CPU instructions available on the build machine. This is good for performance, but means that if you try running on a machine that supports fewer instructions than the build machine you'll get an invalid instruction crash. This is fine when users are building on the machine they'll run on, but is problematic for a pre-built package like from conda. I have a plan to make builds that are both optimized and CPU-agnostic, but that's not merged into the main branch yet. I've just made some modifications to the BPCells install script, so that if the environment variable `BPCELLS_DISABLE_MARCH_NATIVE` is set then the `-march=native` flag won't get set.

With that change, I guess just two remaining questions:

  1. Would you be happy with your upload being suggested in the README as an alternative installation option for linux users?
  2. Does your solution auto-update periodically, or is that something you have to do manually?

Thanks again for figuring out this conda setup!

rschauner commented 5 months ago

The way the build is set up, I need to pull in a version from GitHub, so if you can patch that into a v0.1.1, I can build without the flag set. The build has to be run manually, but could probably be done via a GitHub action. I haven't figured out a way to build it on my M1 Mac or use conda build to cross compile it (maybe without the flag it would work).

It's already public so if you would like to suggest it in the README, I'm perfectly fine with that.

Yunuuuu commented 5 months ago

I have created a package using BPCells backend for DelayedArray objects deposited in https://github.com/Yunuuuu/BPCellsArray, now, we can combine BPCells with Bioconductor worflow

ycli1995 commented 4 months ago

Hi @shahrozeabbas, I hadn't heard of a specific automated connection between bioconda and Bioconductor but it's an interesting thing to consider. Unfortunately the rules of Bioconductor disallow submitting a package that exists on CRAN (and CRAN at least disallows sharing a package name with a Bioconductor package), so I think it has to be one or the other.

There are definitely unique advantages to both CRAN and Bioconductor, though I'm currently leaning towards CRAN as it is the default source when using install.packages and allows a more flexible update schedule. Bioconductor has its merits too, such as more coordinated systems for testing cross-compatibility during version changes, but from where I am right now I think ease of installation might win out.

Hi, @bnprks . I agree that CRAN is better than Bioconductor for BPCells according to the flexible update schedule.

Good question -- yes, it is planned to release BPCells through CRAN but it will probably be a little while (I'd estimate 2-3 months). There are a couple technical changes to the compiled code in order to meet CRAN's portability requirements, and first-time CRAN submissions also have pretty stringent documentation requirements such as examples for every public function.

CRAN does have requirements for not only the stringent documents but also coding style and API specification (exported and unexported functions). In my opinion, it might be time for BPCells to reach the nightly stage of releasing to CRAN, since the core features such as data preprocessing and matrix manipulations have become more and more stable. Would you mind if I spend some effort on tidying up the R codes and documents of BPCells to fit the CRAN check?

bnprks commented 4 months ago

Hi @ycli1995, thanks for the offer to help. I think the best places to get started would be collecting a clear list of CRAN requirements that are not yet met, and possibly making any small (<5-line) changes that would solve one-off issues. This could include disabling long-running vignettes or tests when building on CRAN, for example.

For any requirements that require larger changes throughout the code, I think we should at least agree on the changes to be made first, and in some cases I might prefer to do things myself or heavily edit something you've written. (The example usage section for function documentation will probably be a case of this)

I am working on some technical changes on the C++ side that I think need to come through prior to CRAN release too (so that we can compile without the -march=native flag and not sacrifice performance).

mihem commented 2 months ago

@bnprks sorry, are there any updates regarding BPCells submission to CRAN? I think this would be super useful because 1) compilations takes some time, 2) many of the github issues are related to compilation, 3) BPCells is updated quite regularly.

I am still amazed by the speed of BPCells and would think that lots of users would appreciate a CRAN submission (which also means that it's precompiled, also for Linux via Posit Public Package Manager).

Thanks

bnprks commented 1 month ago

Hi @mihem, there's not much concrete progress to report, though CRAN submission is still very much on the roadmap. The fix for -march=native is now up in the branch highway-simd, so I think adding examples for every public function and worrying about CRAN's 5-10MB package size limits are probably the remaining two big challenges. That said, there are a couple options that can help address your issues 1-3 in the mean time.

For Mac/Windows users, I have an R universe project set up, which provides pre-compiled builds for Mac/Windows that I believe should come with HDF5 statically linked to avoid the most common installation challenge. Those builds automatically track the github main branch and can be installed like this for example: install.packages("BPCells", repos = c("https://bnprks.r-universe.dev", "https://cran.r-project.org"))

For Linux users, it is also possible to speed up compilation by editing your ~/.R/Makevars file. The most impactful change is probably adding the line MAKEFLAGS=--jobs=8, where you can adjust the 8 to match how many cores you want to use in parallel for compilation. Some other changes that require additional software tools would be enabling ccache to speed up recompilation, e.g. with the line CXX=ccache g++, or using the faster mold linker by adding the line LDFLAGS=-fuse-ld=mold.

As mentioned above, CRAN has a manual review process with many requirements for submission that make it much more complicated than, say, uploading a python package to PyPI. But it is still very much on the roadmap, just I've had other more urgent areas to work on personally.

mihem commented 1 month ago

Thanks for the update and the tipps.

I'm on Linux and sure speeding up compilation is great but installing binary is still x fold faster i guess and of course no compilation errors.

I completely understand, I only wanted to highlight that CRAN submission would be appreciated (not only by me personally also by packages that depend on BPCells such as Seurat I think.)

Thanks 🙏