Closed carlosgjs closed 7 months ago
Attention: 1 lines
in your changes are missing coverage. Please review.
Comparison is base (
e7c86f5
) 97.32% compared to head (c864e8c
) 97.85%.
Files | Patch % | Lines |
---|---|---|
src/autora/doc/runtime/predict_hf.py | 93.75% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Running quantized models significantly reduces the GPU memory required for inference. Instead of downloading the full model and quantizing it during the load, we can quantize the model offline and save it. At runtime the (smaller) quantized model can be loaded.
This PR includes 3 changes:
bitsandbytes
andtransformers
versions which support quantizingCloses #4 Closes #8