IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
201 stars 38 forks source link

Any chance of support for TF 2.4.0 (or later) ? #57

Open stark-toulouse opened 2 years ago

stark-toulouse commented 2 years ago

Dear authors of TFLMS,

I am one of the authors of this paper:

https://www.epj-conferences.org/articles/epjconf/abs/2021/05/epjconf_chep2021_03047/epjconf_chep2021_03047.html

The study presented in this paper strongly depends on TFLMS. Our follow-up studies with a real particle physics detector also depend on TFLMS. At the same time, we need to move on to CUDA 11. That pretty much implies migrating to TensorFlow 2.4.0 or later. Is there any chance that you might develop such a version of TFLMS ?

If not, I could of course try to take your published patch for 2.2.0 and merge it into 2.4.0 myself. Do you anticipate any particular/conceptual difficulties with this ?

Thanks ! Jan

smatzek commented 2 years ago

Your work is really interesting and I'm glad LMS was able to help out with LHC work. Unfortunately, none of the original authors have plans to release newer versions of TensorFlow LMS.

I haven't looked at TensorFlow source code since the 2.2.0 release so I can't speak about specific difficulties you would encounter trying to upgrade the LMS patch to a 2.4.0 base. While we only published patches for 2.1 and 2.2, we did start the work back in the 2.0 days. During those two release changes (to 2.1 and 2.2) we saw changes in TensorFlow's native (C/C++ layer) eager execution which caused us to change the locations where we needed to trigger swap outs and swap ins. I think we may have also seen some changes in how tensors were tagged as "needed" and thus changes to how the code knows when it's OK to swap out inactive tensors.

Any changes to the BFC allocators or CUDA mem copy specifics between 2.2.0 and 2.4.0 would also probably drive additional changes.

You could also try reaching out to others who have used TensorFlow LMS to see if they are able to help with getting the code updated to 2.4.0. There are a few open and closed issues where others have talked about using it. In particular, @aviallon, who opened #54 was interested in the status of getting this merged back into the upstream/main TensorFlow code base.

I would also suggest taking a look at PyTorch and PyTorch LMS: https://github.com/IBM/pytorch-large-model-support. Depending on how much TensorFlow has changed vs how much PyTorch has changed, it may be easier to update PyTorch LMS for the latest PyTorch release. However, that would require you to change your models from TensorFlow to PyTorch.

stark-toulouse commented 2 years ago

Thank you for your detailed reply. I will look into all of these options. PyTorch LMS is a realistic option for us, since we have already created a PyTorch version of our models / code.