Save the 4-bit quantized version of the Llama2 7b model in blob storage

We currently load the model from the blob storage in the ML workspace: https://autoraml3241530052.blob.core.windows.net/azureml-blobstore-b7ef477b-ca4a-44e3-a029-0e0542bdcd47/base_models/llama-2-7b-chat-hf/

Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.

If we can save a copy of the model with a new name, e.g. llama-2-7b-chat-hf-4bit, the loading time should significantly decrease.

The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines:

from azureml.core import Workspace, Datastore

new_name = "llama-2-7b-chat-hf-4bit"
model = ... # load and quantize model like in the code predict_hf.py code above

# save it locally
model.save_pretrained(f"./models/{new_name}")

# double check that the quantized model can be loaded too ...

# If all goes well, upload to blob storage:
workspace = Workspace.from_config()
ds = workspace.get_default_datastore()
ds.upload(f"./models/{new_name}", f"./base_models/{new_name}", show_progress=True, overwrite=True)

# verify the model can be loaded from blob storage by submitting a new prediction job with the new model. See README.md

AutoResearch / autodoc

Save the 4-bit quantized version of the Llama2 7b model in blob storage #4