bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

writing bpcells matrices to Zarr stores #139

Open Artur-man opened 1 month ago

Artur-man commented 1 month ago

Hi,

I was wondering if you guys are considering to add support for writing matrices to zarr stores ? It has a similar hierarchical structure to hdf5 and there currently exist implementations spanning some number of programming languages, including R and C++.

Here are some more zarr/R related links: https://github.com/keller-mark/pizzarr/ https://github.com/BIMSBbioinfo/ZarrArray

al2na commented 1 month ago

Yes, this is a good idea, zarr support is highly needed for single cell analysis.

frenkiboy commented 1 month ago

It would be amazing to have this feature!

alexg9010 commented 1 month ago

Related to currently available Zarr implementations for C++ this overview table from the Z5 developer could help choose a backend.

bnprks commented 1 month ago

Hi folks, this is a very nice suggestion and happy to include it on our roadmap!

I have plans for an upcoming internal change that would make it possible to create separate R packages that can interoperate efficiently with BPCells at the C/C++ level. Then it would be very sensible to make a companion package that provides read/write support for zarr.

This would allow us to avoid complicating the BPCells build process -- it is already the source of a lot of user difficulty, and so if zarr support is a separate pacakge then only users that want zarr support will have to deal with any increased build complexity (e.g. also requiring cmake to be installed)

Does anyone have C++ familiarity that might be interested in getting mentored through adding a BPCells zarr support package in a few months? (I could ping you once the required changes are made in BPCells to make companion packages possible)

Artur-man commented 1 month ago

Dear Ben,

I would like to thank you for the prompt response.

We (specifically with @alexg9010) would really like to tackle this and get help from you guys to implement zarr backends. Both me and @alexg9010 have some (and getting better) C++ familiarity. We will be looking forward to your response then.

Also, I truly agree with your approach to having companion packages, which would be similar to DelayedArray backends, e.g. HDF5Array.

Note: It appears TenserStore was already used to provide zarr support for an C++ based image processing tools (ITK) https://forum.image.sc/t/c-zarr-library/70159/25.

bnprks commented 1 month ago

Hi @Artur-man and @alexg9010, that sounds great! I think we can get started with some zarr-only prototyping with C++, then once that's set we can integrate with BPCells. (If the prototyping goes quickly, there may be a bit of a gap while I update BPCells to allow proper interoperability)

The initial steps would be starting simple:

  1. Get tensorstore to build with CMake
  2. Read support prototype: given a zarr 1D array, read a contiguous slice of indices to memory
  3. Write support prototype: Write a new zarr 1D array piece-by-piece without having to hold the full array contents in memory (e.g. write the numbers 1 to 1e9 without using more than 100MB of RAM)

I think it might make sense for me to start up a BPCells Slack workspace for us to have easier back-and-forth, does that sound good to you?

Artur-man commented 1 month ago

I agree with the Slack approach! We have already started to cook a small repo for building tensorstore, looking forward to it.

bnprks commented 1 month ago

That sounds great! I just sent @Artur-man and @alexg9010 a slack invite link via email -- let me know if you had any trouble receiving it.

For anyone else is interested in also getting involved in some BPCells-related coding just ask and I can add you as well.