Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

[RFC] Improve Lightning for production #10270

Closed tchaton closed 1 year ago

tchaton commented 3 years ago

🚀 Feature

Lot of users are complaining for PyTorch Lightning to be un-usable in production due to the extra dependencies, code, etc...

Furthermore, Lightning makes a bad job at conveying the best practices behind production around the LightningModule usage.

The LightningModule should be seen as vessel for training, users call it a System.

Here are some proposals to improve best practices around Lightning production usage.

[] Start converting examples (docs, example) to the System format. [] Dedicate page to best practices for PyTorch Lightning in production. [] Improve checkpointing / reloading with the System design [] Have a pip install pytorch-lightning[prod] which is the smallest possible (no extra libraries like TensorBoard, etc..) [] Introduce a System design which is not a nn.Module and accept (need to be explored)

Any extra bullet points are welcome.

Motivation

Pitch

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

cc @borda

lantiga commented 3 years ago

Hey @tchaton, great discussion! My 2 cents here.

I think we need to have the concept of “production” defined precisely. Exporting a model to ONNX or TorchScript alone is unfortunately a far cry from deploying to production in many circumstances, so there’s typically a good chunk of (typically) Python code you need (or you need to replace) for full inference. However it may not even make sense to execute it on the same machine in some applications.

One first hand example: at Orobix data scientists are customarily training with Lightning, exporting to TorchScript and running in RedisAI, but they have their preprocessing / postprocessing code (that is often different from the pre-processing code you have during training) in other parts of the system. And the system is typically composed of more than one machine at the edge, so TorchScript runs on the expensive (e.g. GPU) machine, but pre-processing may execute on another machine. This is where stuff like RedisAI is handy: you have a microservice or a physical appliance that runs all the inferences as fast as possible, without blocking to pre- and post-process, which can be done elsewhere. This is of course specific to the context, but not too different from certain cloud deployments: if you had a V100 running inference on the cloud, you may not want cheap CPU code to interfere with that. So I think in general the concept of production-ready needs to be better qualified. I would be cautious about revolutionizing the API before we have the problem super-clear.

As for the “System” pattern, in Lightning we are indeed conflating the model (the nn.Module) and what is around it (the System). This is what is weird about Lightning when you first approach it, but it’s also really convenient specifically when you are just starting off - everything is in one place and you don’t need to worry about getting the full picture, which can be daunting at the beginning. IMO a good way to go would be to maintain backward compatibility by making a nn.Module-based mixin, that can be added by default to the Lightning “system” class (I personally don’t like this word, intuitively to me the “system” in Lightning is the Trainer, but I don’t have a better idea right now) when you instantiate a LightningModule. So the LightningModule will stay what it is, but there’s a Lightning non-Module that you can use if you need it (e.g. if you need to instantiate your own nn.Module outside and pass it to the constructor).

+1 for checkpointing and reloading as well as small footprint packaging option