Suggestion: remove SBX loading code and recommend using sbxreader instead

EricThomson commented 8 months ago

Discussed in https://github.com/flatironinstitute/CaImAn/discussions/1298

^{Originally posted by **ethanbb** March 14, 2024} Hello, This is mainly a documentation suggestion. My microscope saves Scanbox files, and I saw CaImAn has support for loading them, so I tried that but got an error because the loading functions are not up-to-date with newer versions of the file format (which has changed several times). The specific error was about inferring the number of channels (my `info` did not have `'channels'`, but rather the newer `'chan'` field). Looking into it a little more, I realized that the current function doesn't account for a number of features, including: - Volumetric scanning (optotune) - Bidirectional scanning - "Fold lines" (multiple subframes per frame) - Loading channels other than the first one I took a few days to implement these features, and I was going to submit a pull request. However, towards the end I realized that the package [sbxreader](https://github.com/jcouto/sbxreader) exists (which suite2p uses for loading). Besides being actively maintained, it's much better than what I wrote because it memory-maps the file - I realized when trying to test my function that it was not going to happen because my 40-minute recording was too large to fit into memory. It also has another function for loading all metadata, which is probably important to almost everyone who would need to load SBX data. I could continue to tweak what I wrote to use memmap etc., but I think what makes the most sense is to just recommend Scanbox users to install this package and use it to get an np.memmap of their data, then load it into CaImAn using the movie constructor directly. To encourage this (and because it's out of date), it probably makes sense to remove the current SBX loading code.

pgunn commented 8 months ago

Worth thinking about; one downside is that sbxreader doesn't have conda packaging; I'm generally reluctant to suggest users add anything to their conda environment with pip because it can mess with a lot of things (and we'd never want anything like that in a notebook). Still, if our support is way out of date, we need to do something about it.

Maybe a solution will come to mind for this.

ethanbb commented 8 months ago

Good point - and it sounds like this is a concern whether sbxreader was just recommended in the docs or included as a dependency and used for loading (which would be another option).

I wonder if the maintainer would be open to adding sbxreader to conda-forge. Worth reaching out?

EricThomson commented 8 months ago

I was going to say "don't worry about pip/conda it usually is fine it's only a problem with things like opencv and pyqt" and then I looked into their code and found things like this:

try:
    import cv2
except Exception as err:
    print('cv2 not installed, trying to install dependencies with pip ')
    from subprocess import call
    call('pip install opencv-python pyqtgraph pyqt5',shell = True)
    import cv2

This unfortunately has a reasonable potential to wreak havoc with an innocent conda environment. 😬

One thing I'm not sure about @ethanbb do we only need the loadmat.py part of their package? If so I would suggest we could adapt their code into Caiman (with proper attribution of course), but it looks like their code is GPL-3, and ours is GPL-2, so it may not be permissible (unless ours is GPL2+ it doesn't seem to be but this is something to look into more).

ethanbb commented 8 months ago

Ooh yeah that's a bit aggressive, but I guess an easy fix is to just raise an error in the except block instead (and not require cv2 as a package dependency, as it's not required for the main functionality). This could at least be done in the conda version, if the maintainer is willing to submit it to conda-forge. Or alternatively, we could propose the change in the main repository and then once that's merged, recommend installing sbxreader using pip, if it's unlikely to cause an issue without that.

pgunn commented 8 months ago

I'm pretty sure I never want to point a user at pip-only packages, and any software package that would even think about calling out to pip is a strong nope.

I think I see two most likely routes forward: 1) We try to modernise our support for sbx; this might not be that hard, and if there are enough users who need it this could make things easy for them 2) We remove sbx support and provide some instructions or hints for people to use external software (not in the caiman conda env) to convert their stuff first

Either of these is reasonable; it's probably too high effort to try to work with the sbxreader people to fix their code and get it onto conda-forge (I don't know if we really have a lot of sbx users; I suspect not if this is the first we've heard of it, they'd at least need to remove any callouts to pip, and that code snippet above suggests the code quality for the package is not high).

Are there other good conversion tools out there to pursue route 2?

ethanbb commented 8 months ago

@pgunn while I have no particular allegiance to sbxreader (I don't know the author), I'd encourage you to take a look at the code - it's only about 400 lines, half of which is the viewer which would not be needed for caiman. I believe that excerpt is the only questionable bit and is only used for the viewer.

That being said, I understand the reluctance to rely on another unknown developer to do a thing and keep it maintained. If you wanted to pursue route 1 instead, it wouldn't be too much more work for me to edit my caiman feature branch that has my existing implementation to use a custom subclass of np.memmap, just like sbxreader, so that files can be read without fully loading into memory. (The custom subclass is needed because the values have to be processed during reading in order to get them into the right range). It would essentially be copying the idea from their code, which I feel somewhat ethically ambiguous about, but I can at least cite them in a comment.

pgunn commented 8 months ago

I'll give that codebase a look (although I'm starting skeptical; every dependency has a maintenance cost).

I wouldn't worry about copying ideas like that; this isn't even slightly rare in software development and it's not that novel of an idea anyhow. We'd need to know if it's something maintainable though.

In the meantime, we should also do at least a cursory exploration of what I called option 2 above ; if there are well-known tools to convert sbx files to a more common format, it might be viable to just point people at them. We could even consider writing a minimal converter ourself; it might have less code complexity to have a one-time conversion upfront as the access patterns would be very easy to predict.

ethanbb commented 8 months ago

Ah I didn't register that you were talking about converting and saving as a new file instead of reading the sbx file directly. Agree that could be a good option (although it is costly in disk space). There is an hdf5 converter included in Scanbox that seems to work well with ImageJ, but I don't think it supports multiple planes at present.

(And yeah I was more thinking about the fact that my code would probably look like a copy of theirs, but that doesn't have to literally be true)

pgunn commented 8 months ago

That may be the best route; I don't know how common multi-plane images are for our users (although I suspect sbx users are pretty rare in general). We support hdf5 as an input format, so if scanbox itself can save in hdf5 (and multi-plane isn't an issue), then we just need to tell users to do that. In this case we'd probably want to remove the current (dated) sbx support to avoid confusing people.

kushalkolar commented 8 months ago

@ethanbb I was just taking a look at the reader code and it actually doesn't do anything special, just directly memmaps the file which is seems to be stored as an ordinary uint16 array on disk:

https://github.com/jcouto/sbxreader/blob/e3e01978eedf2013d8c6c2acbaa80412a325c3aa/sbxreader/reader.py#L138-L142

I don't understand why there are 5 dimensions though: https://github.com/jcouto/sbxreader/blob/main/sbxreader/reader.py#L143

Also, it seems like it's stored in F order so it might be possible to directly use these files for motion correction.

ethanbb commented 8 months ago

@kushalkolar I was just referring to the fact that it defines this sbx_memmap class that overrides the __getitem__ function to subtract values from UINTMAX when reading. It is pretty straightforward.

The 5 dimensions are channels, X, Y, Z, and time (in that order, Fortran order so channel changes most frequently, then x position, etc.) Then this package reshapes it to (time, Z, channel, Y, X) but that is somewhat arbitrary.

kushalkolar commented 8 months ago

Is the data stored such that the highest value in the file maps to 0 and the lowest value in the file maps to 2^16?

ethanbb commented 8 months ago

@pgunn I think multi-plane acquisition is pretty common for Scanbox users - it's a prominent feature for the microscope and software. If this is less of a focus for caiman, which I suspected, most users would probably just stick to suite2p, and maybe that's fine although it would be nice to be able to easily compare the different cell detection algorithms offered by different packages.

I could ask Dario from Neurolabware to update the HDF5 conversion function to take volumes into account (I believe putting Z in the 4th dimension is what caiman expects?) It would be a little more painful for users than loading sbx files directly but yeah not a terrible solution.

ethanbb commented 8 months ago

@kushalkolar yes (rather, 2^16-1). Don't ask me why.

pgunn commented 8 months ago

2 ^ 16 -1 actually makes more sense than 2^16, as the former is the halfway point for a 32-bit wide range of values (because the first value is 0).

(edit: this isn't quite correct, but the -1 is there because of 0)

EricThomson commented 8 months ago

Volumetric data is an important use case. Sometimes it falls through the cracks but please keep letting us know when this happens: it may not be fast but we can work with you to get it sorted.

ethanbb commented 8 months ago

@EricThomson thank you, will do!

kushalkolar commented 8 months ago

Ah ok. Well since motion correction is going to produce its own float32 memmap anyways, the faster route to go might be to directly convert your sbx files to float32 memmaps with that substraction, that's what I would do.

kushalkolar commented 8 months ago

To elaborate I think all you would need to do is:

Use a memmap subclass like theirs where you override __getitem__ to return the inverted vals for a given slice (similar to theirs), numpy.invert will be much faster than substraction though because it's just direct bit manipulation (probably fewer clock cycles in general).
Create a float32 numpy memmap file with shape [n_pixels, n_frames] I think, I recommend looking into the save_memmap() code but I don't have the bandwidth to dig into this. Write each frame into that memmap from the input movie by flattening it.
I think that should be good to go for motion correction and you bypass having to use the file type loaders etc.

ethanbb commented 8 months ago

@kushalkolar Thanks for the tips. It definitely makes sense to use numpy.invert, that's an oversight on their part.

As for saving an intermediate float32 memmap to feed into motion correction, I'm not sure it's necessary? I am just familiarizing myself with this code, but from what I can tell, the motion correction code loads the movie in pieces, the number of which can be customized to avoid running out of memory. It looks like the actual motion correction step makes a float64 copy of each piece, no?

logging.debug("Starting tile and correct")
img = img.astype(np.float64).copy()

So even if img were a float64 memmap, the data would be copied anyway. Seems simpler to just load the SBX file as a custom memmap with dtype uint16 (without saving a copy), and each piece will get converted and copied in this step. Very possible I am missing something, though.

EricThomson commented 8 months ago

@ethanbb the wrinkle is that caiman converts to float because we do lots of floating point operations for motion correction and source extraction. Downside: why the heck is this file so big? Upside: more precision. The cost/benefit here was: hard drive space is cheap, RAM is expensive. Is it a perfect system? 🤷‍♂️ Probably not. 😬 Does it work pretty well? We hope so. If you search through our issue/discussion queue this comes up a lot.

ethanbb commented 8 months ago

Alright, after trying to implement a container that automatically inverts the data by subclassing np.memmap, I've realized this approach is a lot more limited and brittle than it initially seemed, and that sbxreader also suffers from this in some alarming ways. I would have to go pretty low level to avoid the object being treated as or converted to a normal ndarray without the inversion taking place. And even if it's possible, a lot of operations are probably much slower without direct access to a C array with the actual correct data.

I now agree that conversion is probably worth the extra disk space. I think that instead of copying the data when calling load (which woould probably also want to check whether it's already been converted and just load that), it would be less surprising to just provide a utility function that converts to the best format for subsequent loading, probably a '.mmap' file, and gives back the new filename. We don't have to do any subclassing - just load the data in chunks, invert, and save to the new file.

pgunn commented 8 months ago

In the long term we've talked about doing this; it's on my longer-term todo list to always convert from every format into a predictable one (alongside a major revamp of file and path handling code).

We'll still have to decide, on a format-by-format basis, which input types should have explicit code to do such conversion and which should have external pointers (or pointers to external software that can do such conversions), but we'll cross that bridge later.

ethanbb commented 8 months ago

I just realized that maybe we don't have to worry about memory-mapping sbx files? Since for motion correction, the load function only loads the data in smaller chunks at a time, anyway. And for viewing, you can always downsample or just look at a portion at a time. I was confused why many of the formats I could convert to don't get memory-mapped when loaded, and then I realized maybe they don't have to be.

Anyway, no need for further delay - I will submit a pull request with code that loads newer sbx files, along with functions to convert to tiffs since that seems to be the most versatile/well-supported format.

pgunn commented 8 months ago

This sounds like a good solution for the near-to-mid term. Looking forward to the PR.

Incidentally, if you have a sbx file that you don't mind either: A) Letting us use for internal testing, or B) Including in the source tree for testing

that'd be really great because we could integrate it into the test suite (if you want to take a shot at this yourself, that'd be great, although I'm happy to do it). More automated testing makes a lot of development easier, and test data for the file formats is super helpful.

(if you can't, that's totally fine too; no pressure)

pgunn commented 7 months ago

Should we mark this as closed because of Ethan's PRs?

pgunn commented 6 months ago

Marking this as closed (for now) because of ethan's contributions (which just went out with today's release).

flatironinstitute / CaImAn

Suggestion: remove SBX loading code and recommend using sbxreader instead #1299

Discussed in https://github.com/flatironinstitute/CaImAn/discussions/1298