AutoResearch / autodoc

MIT License
0 stars 1 forks source link

Save the 4-bit quantized version of the Llama2 7b model in blob storage #4

Closed carlosgjs closed 7 months ago

carlosgjs commented 9 months ago

We currently load the model from the blob storage in the ML workspace: https://autoraml3241530052.blob.core.windows.net/azureml-blobstore-b7ef477b-ca4a-44e3-a029-0e0542bdcd47/base_models/llama-2-7b-chat-hf/

Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.

If we can save a copy of the model with a new name, e.g. llama-2-7b-chat-hf-4bit, the loading time should significantly decrease.

The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines:

from azureml.core import Workspace, Datastore

new_name = "llama-2-7b-chat-hf-4bit"
model = ... # load and quantize model like in the code predict_hf.py code above

# save it locally
model.save_pretrained(f"./models/{new_name}")

# double check that the quantized model can be loaded too ...

# If all goes well, upload to blob storage:
workspace = Workspace.from_config()
ds = workspace.get_default_datastore()
ds.upload(f"./models/{new_name}", f"./base_models/{new_name}", show_progress=True, overwrite=True)

# verify the model can be loaded from blob storage by submitting a new prediction job with the new model. See README.md
carlosgjs commented 7 months ago

4 bit serialization needs transformers 4.37 according to https://github.com/TimDettmers/bitsandbytes/pull/753