Ideas about storage engine

kushalkolar commented 2 years ago

Hi Pat, it was great to meet you a few weeks ago at the workshop! I had some ideas about the storage engine, this is just brainstorming. I don't know how this would apply for online (OnACID etc.) since I've never used them.

All files associated with a run of an algorithm go into their own dir, could be named something like <orig_filename>_<algo>_<timestamp>.
1. This could happen at the initial creating/saving memmap stage, I'm thinking of the caiman.save_memmap() which you do before CNMF. Currently this is already implemented with mcorr if you use CAIMAN_TEMP, MotionCorrect.mmap_file puts it in the CAIMAN_TEMP dir.
  - In this case those methods would have to create the output dir
2. Another option is to create an output every time caiman is just imported, however things would again get overwritten when using mcorr.
3. The user could manually call a function set_global_output_dir(<path>) that sets the CAIMAN_TEMP env variable, after which all output gets put in that dir. They can just call the function as many times as they want. This potentially gives users the most flexibility? I think this is my favorite idea so far.

In regards to downstream compatibility with mesmerize-core, The outputs of a batch item, which is a single run of an algorithm on a single movie, get organized in a single dir. If there are multiple runs on the same input movie, each run has its own output dir.

For mcorr, currently I use the CAIMAN_TEMP env var to store the mcorr memmap outputs. Correlation and PNR images are saved used np.save() to the output dir.

For CNMF, the hdf5 output files are saved to the output dir using cnmf.CNMF.save(), just like how you would use it in a notebook. I've tried to make it as close as possible. The CNMF C-order memmap file is created using caiman.save_memmap() and then after the CNMF run is finished it's just moved to the output dir. Correlation and PNR images are again just saved using np.save()

pgunn commented 2 years ago

Some of this aligns with what I have in mind; it'd be nice to try to be smart about when a directory can be reused and when it can't, where we perhaps hash the parameters towards that end. So if someone starts another run with different params they get a different directory. Hopefully a full design that does a good enough job at everything will crystalise soon.

kushalkolar commented 2 years ago

I like the hashing idea! Maybe implementing __hash__ for CNMFParams is a starting point?

flatironinstitute / CaImAn

Ideas about storage engine #988