New feature of adding Intel® Extension for Transformers weight-only quantization into Lightning Fabric API

yuwenzho commented 9 months ago

Description & Motivation

Hello,

We are the team working on the development of Intel® Extension for Transformers. We would like to discuss the quantize feature in relation to our projects.

Allow us to provide an introduction to both projects firstly:

Intel® Extension for Transformers (ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).

We would like to integrate ITREX into the PyTorch Lightning Fabric API. This integration could involve INT8/INT4/FP4/NF4 weight-only quantization feature.

We would like to ask if there is an opportunity for us to make some contributions in this regard.

Thanks

Pitch

Here is a simple use case:

from lightning.fabric.plugins import ITREXPrecision

precision = ITREXPrecision(mode="int8") # mode: Literal["int8", "int4_fullrange", "int4_clip", "nf4", "fp4_e2m1"]
fabric = Fabric(plugins=precision)
model = MyModel()
model = fabric.setup(model)

For more details of ITREX 4-bit, please refer to the medium blog of Intel-Optimized Llama.CPP

Alternatives

No response

Additional context

No response

cc @borda

yuwenzho commented 7 months ago

Related PR #19125

awaelchli commented 6 months ago

Hello @yuwenzho Thank you for opening the pull request. We discussed this internally and want to propose that we do this integration in an external plugin repository first instead of moving into core directly. The idea is that we create a separate repository with you as the co-maintainer, for example "lightning-intel" (actual name to be decided) and we provide an easy way to enable it for the user like so:

# Optional dependency
pip install lightning-intel

# In user code:
from lightning_intel import ITREXPrecision
trainer = Trainer(..., precision= ITREXPrecision(mode=...))

We have followed the same approach with other partners like Habana: https://github.com/Lightning-AI/lightning-Habana https://lightning.ai/docs/pytorch/stable/integrations/hpu/intermediate.html

A benefit for you would be that you could make changes to the integration outside of Lightning core faster, for example when a new version of your backend is released.

Thanks, and we would love to hear your thoughts.

cc @lantiga @carmocca @Borda

ftian1 commented 6 months ago

@awaelchli many thanks for your valuable inputs. we are evaluating this feasibility and will get back to you if we have conclusion here.

yuwenzho commented 4 months ago

@awaelchli Sorry for the late reply. Below is our RFC. We appreciate your thoughts and feedback, please feel free to share any comments or suggestions you have.

Our expected external repository name: Lightning-AI/lightning-Intel

Design Detail

Directory layout:

lightning-Intel
└───src
    └───lightning_intel
        ├───fabric
        │   └───plugins
        │       ├───precision.py  # customized WOQ precision
        │       └───io_plugin.py  # checkpoint save/load
        │
        └───pytorch
            └───plugins    
                ├───precision.py  # customized WOQ precision
                └───io_plugin.py  # checkpoint save/load

Usage Demo:

# Usage with lightning fabric
from lightning_intel.fabric.plugins import INCPrecision
from lightning.fabric import Fabric
fabric = Fabric(plugins=INCPrecision(mode="int4"))    
model = MyModel()
model = fabric.setup(model)

# Usage with pytorch lightning
from lightning_intel.pytorch.plugins import INCPrecision
trainer = Trainer(plugins=INCPrecision(mode="int4"))
trainer.train(model=MyModel(), dataloaders=DataLoader(train_set))
predictions = trainer.predict(model, dataloaders=DataLoader(pred_set))

yuwenzho commented 3 months ago

Hi @awaelchli Could you please help us create Lightning-AI/lightning-Intel repo so that we can start our code contributions? cc @hshen14

Lightning-AI / pytorch-lightning