Closed EricThomson closed 6 months ago
Worth thinking about; one downside is that sbxreader doesn't have conda packaging; I'm generally reluctant to suggest users add anything to their conda environment with pip because it can mess with a lot of things (and we'd never want anything like that in a notebook). Still, if our support is way out of date, we need to do something about it.
Maybe a solution will come to mind for this.
Good point - and it sounds like this is a concern whether sbxreader was just recommended in the docs or included as a dependency and used for loading (which would be another option).
I wonder if the maintainer would be open to adding sbxreader to conda-forge. Worth reaching out?
I was going to say "don't worry about pip/conda it usually is fine it's only a problem with things like opencv and pyqt" and then I looked into their code and found things like this:
try:
import cv2
except Exception as err:
print('cv2 not installed, trying to install dependencies with pip ')
from subprocess import call
call('pip install opencv-python pyqtgraph pyqt5',shell = True)
import cv2
This unfortunately has a reasonable potential to wreak havoc with an innocent conda environment. 😬
One thing I'm not sure about @ethanbb do we only need the loadmat.py
part of their package? If so I would suggest we could adapt their code into Caiman (with proper attribution of course), but it looks like their code is GPL-3, and ours is GPL-2, so it may not be permissible (unless ours is GPL2+ it doesn't seem to be but this is something to look into more).
Ooh yeah that's a bit aggressive, but I guess an easy fix is to just raise an error in the except block instead (and not require cv2 as a package dependency, as it's not required for the main functionality). This could at least be done in the conda version, if the maintainer is willing to submit it to conda-forge. Or alternatively, we could propose the change in the main repository and then once that's merged, recommend installing sbxreader using pip, if it's unlikely to cause an issue without that.
I'm pretty sure I never want to point a user at pip-only packages, and any software package that would even think about calling out to pip is a strong nope.
I think I see two most likely routes forward: 1) We try to modernise our support for sbx; this might not be that hard, and if there are enough users who need it this could make things easy for them 2) We remove sbx support and provide some instructions or hints for people to use external software (not in the caiman conda env) to convert their stuff first
Either of these is reasonable; it's probably too high effort to try to work with the sbxreader people to fix their code and get it onto conda-forge (I don't know if we really have a lot of sbx users; I suspect not if this is the first we've heard of it, they'd at least need to remove any callouts to pip, and that code snippet above suggests the code quality for the package is not high).
Are there other good conversion tools out there to pursue route 2?
@pgunn while I have no particular allegiance to sbxreader (I don't know the author), I'd encourage you to take a look at the code - it's only about 400 lines, half of which is the viewer which would not be needed for caiman. I believe that excerpt is the only questionable bit and is only used for the viewer.
That being said, I understand the reluctance to rely on another unknown developer to do a thing and keep it maintained. If you wanted to pursue route 1 instead, it wouldn't be too much more work for me to edit my caiman feature branch that has my existing implementation to use a custom subclass of np.memmap, just like sbxreader, so that files can be read without fully loading into memory. (The custom subclass is needed because the values have to be processed during reading in order to get them into the right range). It would essentially be copying the idea from their code, which I feel somewhat ethically ambiguous about, but I can at least cite them in a comment.
I'll give that codebase a look (although I'm starting skeptical; every dependency has a maintenance cost).
I wouldn't worry about copying ideas like that; this isn't even slightly rare in software development and it's not that novel of an idea anyhow. We'd need to know if it's something maintainable though.
In the meantime, we should also do at least a cursory exploration of what I called option 2 above ; if there are well-known tools to convert sbx files to a more common format, it might be viable to just point people at them. We could even consider writing a minimal converter ourself; it might have less code complexity to have a one-time conversion upfront as the access patterns would be very easy to predict.
Ah I didn't register that you were talking about converting and saving as a new file instead of reading the sbx file directly. Agree that could be a good option (although it is costly in disk space). There is an hdf5 converter included in Scanbox that seems to work well with ImageJ, but I don't think it supports multiple planes at present.
(And yeah I was more thinking about the fact that my code would probably look like a copy of theirs, but that doesn't have to literally be true)
That may be the best route; I don't know how common multi-plane images are for our users (although I suspect sbx users are pretty rare in general). We support hdf5 as an input format, so if scanbox itself can save in hdf5 (and multi-plane isn't an issue), then we just need to tell users to do that. In this case we'd probably want to remove the current (dated) sbx support to avoid confusing people.
@ethanbb I was just taking a look at the reader code and it actually doesn't do anything special, just directly memmaps the file which is seems to be stored as an ordinary uint16
array on disk:
I don't understand why there are 5 dimensions though: https://github.com/jcouto/sbxreader/blob/main/sbxreader/reader.py#L143
Also, it seems like it's stored in F
order so it might be possible to directly use these files for motion correction.
@kushalkolar I was just referring to the fact that it defines this sbx_memmap
class that overrides the __getitem__
function to subtract values from UINTMAX
when reading. It is pretty straightforward.
The 5 dimensions are channels, X, Y, Z, and time (in that order, Fortran order so channel changes most frequently, then x position, etc.) Then this package reshapes it to (time, Z, channel, Y, X) but that is somewhat arbitrary.
Is the data stored such that the highest value in the file maps to 0 and the lowest value in the file maps to 2^16?
@pgunn I think multi-plane acquisition is pretty common for Scanbox users - it's a prominent feature for the microscope and software. If this is less of a focus for caiman, which I suspected, most users would probably just stick to suite2p, and maybe that's fine although it would be nice to be able to easily compare the different cell detection algorithms offered by different packages.
I could ask Dario from Neurolabware to update the HDF5 conversion function to take volumes into account (I believe putting Z in the 4th dimension is what caiman expects?) It would be a little more painful for users than loading sbx files directly but yeah not a terrible solution.
@kushalkolar yes (rather, 2^16-1). Don't ask me why.
2 ^ 16 -1 actually makes more sense than 2^16, as the former is the halfway point for a 32-bit wide range of values (because the first value is 0).
(edit: this isn't quite correct, but the -1 is there because of 0)
Volumetric data is an important use case. Sometimes it falls through the cracks but please keep letting us know when this happens: it may not be fast but we can work with you to get it sorted.
@EricThomson thank you, will do!
Ah ok. Well since motion correction is going to produce its own float32
memmap anyways, the faster route to go might be to directly convert your sbx files to float32
memmaps with that substraction, that's what I would do.
To elaborate I think all you would need to do is:
__getitem__
to return the inverted vals for a given slice (similar to theirs), numpy.invert
will be much faster than substraction though because it's just direct bit manipulation (probably fewer clock cycles in general).float32
numpy memmap file with shape [n_pixels, n_frames]
I think, I recommend looking into the save_memmap()
code but I don't have the bandwidth to dig into this. Write each frame into that memmap from the input movie by flattening it.@kushalkolar Thanks for the tips. It definitely makes sense to use numpy.invert
, that's an oversight on their part.
As for saving an intermediate float32
memmap to feed into motion correction, I'm not sure it's necessary? I am just familiarizing myself with this code, but from what I can tell, the motion correction code loads the movie in pieces, the number of which can be customized to avoid running out of memory. It looks like the actual motion correction step makes a float64
copy of each piece, no?
logging.debug("Starting tile and correct")
img = img.astype(np.float64).copy()
So even if img
were a float64
memmap, the data would be copied anyway. Seems simpler to just load the SBX file as a custom memmap with dtype uint16
(without saving a copy), and each piece will get converted and copied in this step. Very possible I am missing something, though.
@ethanbb the wrinkle is that caiman converts to float because we do lots of floating point operations for motion correction and source extraction. Downside: why the heck is this file so big? Upside: more precision. The cost/benefit here was: hard drive space is cheap, RAM is expensive. Is it a perfect system? 🤷♂️ Probably not. 😬 Does it work pretty well? We hope so. If you search through our issue/discussion queue this comes up a lot.
Alright, after trying to implement a container that automatically inverts the data by subclassing np.memmap, I've realized this approach is a lot more limited and brittle than it initially seemed, and that sbxreader also suffers from this in some alarming ways. I would have to go pretty low level to avoid the object being treated as or converted to a normal ndarray without the inversion taking place. And even if it's possible, a lot of operations are probably much slower without direct access to a C array with the actual correct data.
I now agree that conversion is probably worth the extra disk space. I think that instead of copying the data when calling load
(which woould probably also want to check whether it's already been converted and just load that), it would be less surprising to just provide a utility function that converts to the best format for subsequent loading, probably a '.mmap' file, and gives back the new filename. We don't have to do any subclassing - just load the data in chunks, invert, and save to the new file.
In the long term we've talked about doing this; it's on my longer-term todo list to always convert from every format into a predictable one (alongside a major revamp of file and path handling code).
We'll still have to decide, on a format-by-format basis, which input types should have explicit code to do such conversion and which should have external pointers (or pointers to external software that can do such conversions), but we'll cross that bridge later.
I just realized that maybe we don't have to worry about memory-mapping sbx files? Since for motion correction, the load
function only loads the data in smaller chunks at a time, anyway. And for viewing, you can always downsample or just look at a portion at a time. I was confused why many of the formats I could convert to don't get memory-mapped when loaded, and then I realized maybe they don't have to be.
Anyway, no need for further delay - I will submit a pull request with code that loads newer sbx files, along with functions to convert to tiffs since that seems to be the most versatile/well-supported format.
This sounds like a good solution for the near-to-mid term. Looking forward to the PR.
Incidentally, if you have a sbx file that you don't mind either: A) Letting us use for internal testing, or B) Including in the source tree for testing
that'd be really great because we could integrate it into the test suite (if you want to take a shot at this yourself, that'd be great, although I'm happy to do it). More automated testing makes a lot of development easier, and test data for the file formats is super helpful.
(if you can't, that's totally fine too; no pressure)
Should we mark this as closed because of Ethan's PRs?
Marking this as closed (for now) because of ethan's contributions (which just went out with today's release).
Discussed in https://github.com/flatironinstitute/CaImAn/discussions/1298