AutoResearch / autodoc

MIT License
0 stars 1 forks source link

feat: Publish/load pre-quantized models #34

Closed carlosgjs closed 7 months ago

carlosgjs commented 7 months ago

Running quantized models significantly reduces the GPU memory required for inference. Instead of downloading the full model and quantizing it during the load, we can quantize the model offline and save it. At runtime the (smaller) quantized model can be loaded.

This PR includes 3 changes:

Closes #4 Closes #8

codecov-commenter commented 7 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (e7c86f5) 97.32% compared to head (c864e8c) 97.85%.

Files Patch % Lines
src/autora/doc/runtime/predict_hf.py 93.75% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #34 +/- ## ========================================== + Coverage 97.32% 97.85% +0.53% ========================================== Files 5 5 Lines 224 233 +9 ========================================== + Hits 218 228 +10 + Misses 6 5 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.