Tracking task: Migrate from tensorflow to pytorch (or possibly numba)

pgunn commented 1 year ago

In the long term, tensorflow as a software package is on its way out; there hasn't been an official announcement from Google yet (to my knowledge), but Google has diverted resources away from it towards jax.

Right now, Caiman has some dependencies on Tensorflow:

utils/nn_models
The components evaluation code can use it (through keras APIs)
CNMF can also use it through keras APIs
volpy (which is an optional part of the codebase) has a vendored "mrcnn" part of the codebase that's itself a bundle of problems) that has a firm dependency on the keras APIs

In the mid-to-far term we need to migrate off of tensorflow to some other API with a longer future, before we fall off the end of cuda support (and, less crucially, before we suffer too much from a lack of newer features as neglect sets into Tensorflow).

In my view, the real options are: A) Pytorch - mature, feature-rich, very widely-used and well-documented and maintained by a company with engineers generally better-behaved than Google's. Probably the easiest choice. B) Numba - Another solid choice, much closer to numpy programming than anything else, for better and for worse. I lean towards Pytorch, but my knowledge on this is pretty shallow; if the consensus leans towards numba, I will defer.

We may have missed out on some other option on what we should do; if we have, it'd be good to hear about it.

Jax is still a no-go for us; it has a fragile build environment and poor Windows support.

This task then is to decide what to migrate our current Tensorflow-using code to, develop a plan to migrate most of it, figure out how to handle the more troublesome code (volpy/mrcnn in particular, which might need to be substantially rewritten), and then actually perform these rewrites. This will likely take several months at least, and take a lot of discussion along the way.

EricThomson commented 1 year ago

Some of the stuff I found when I researched it a couple of weeks ago a little bit.

One, this is a lot bigger of a task than I initially thought it would be. For instance,

As you mentioned, online CNMFE has cnn based ring model all in tf -- see utils.nn_models.
There are resources for augmentation that were hand-rolled in tensorflow in utils.preprocessing_keras.py.

There are lots of nooks and crannies, a lot of things will break.

I looked up converting models from tf to torch, and there were a few halfway decent things out there, but basically it isn't entirely trivial:

I also looked into onnyx (https://github.com/onnx/onnx), which is a tool to translate models between tf and torch. We are probably right about in the zone where learning onnyx would be about as much work as just redoing things in torch. But it is something we should be aware of and consider.

Basically, converting everything will require a lot of hand coding, I don't think we can automate this. I was just covering my bases by researching that stuff. :smile:

I agree I don't think numba would do the trick: it sort of solves an orthogonal problem. For large-scale CNNs the main choice is probably torch, or Jax and as you said there are issues with Jax. There are other lesser known options too but I feel we should stick with big community, big support base, stay under a large umbrella. So basically we should go to torch.

kushalkolar commented 1 year ago

Idea in addition to migration: At the workshop I spoke with Andrea about adding CNNs for other cell types/subcellular compartments for seeding CNMF, as well as for quality inspection. Since this would be a major refactor, it would be useful to consider doing the refactor such that this is doable lager. Or incorporating these other networks during the refactor.

pgunn commented 1 year ago

@kushalkolar That's an interesting idea; I suspect it'd be an idea that would need someone who could commit to leading it moving forward (or possibly leading each poposed additional cell type/compartment - we might need a proof-of-concept and then domain experts for each beyond the first to make this kind of thing work).

I'd rather just leave space for something like this in the refactor rather than add it to the list of goals in the refactor, as trying to do this is already biting off quite a lot of difficult work.

(by leave space for, I mean see if we can find a way to make an extension mechanism easy, or something like that)

kushalkolar commented 1 year ago

I'd rather just leave space for something like this in the refactor rather than add it to the list of goals in the refactor, as trying to do this is already biting off quite a lot of difficult work.

Yup, we could see how most people make these models and think about how Caiman can be interoperable with them. For initialization I think it's easy, just get a binary mask from your other CNN and feed it. But for quality inspection maybe we could just add an argument that takes a list of arrays, where each array is a quality metric value for each component. And a must of thresholds that correspond to each metric.

EricThomson commented 1 year ago

It would be nice to have a clock with this: e.g., what is a reasonable estimate of when caiman will become unworkable because of the tensorflow issues? Are we talking a year, two years, three years? E.g., at what point will tf 2.11 (I believe that is the last version that offers good windows support) start to just not play well with the rest of the data science ecosystem? Or are the maintainers of the major libraries currently putting in accommodations because tensorflow is so important?

Something to keep in mind is we don't have to switch all at once. We could do just the component evaluation model for CNMF for instance at first. I think deep lab cut is making the move incrementally: they currently have torch and tensorflow as a dependency.

kushalkolar commented 1 year ago

It would be nice to have a clock with this: e.g., what is a reasonable estimate of when caiman will become unworkable because of the tensorflow issues?

I think v2.10 is the last version that works natively with Windows. Going from this table it seems like tensorflow 2.10 works with python3.10: https://www.tensorflow.org/install/source_windows#cpu

And if we look at NEP 29, numpy will stop supporting python 3.10 after December 2024. My guess is a lot of the python ecosystem follows NEP 29, for example even ipywidgets. https://numpy.org/neps/nep-0029-deprecation_policy.html

So there's some time, but not a lot. @EricThomson you were JIT compiled to caiman dev :smile:

EricThomson commented 1 year ago

tf cpu will continue to work at the very least.

kushalkolar commented 1 year ago

Ah I link the wrong one: https://www.tensorflow.org/install/source_windows#gpu

But it's still python3.10

flatironinstitute / CaImAn

Tracking task: Migrate from tensorflow to pytorch (or possibly numba) #1126