Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.
If we can save a copy of the model with a new name, e.g. llama-2-7b-chat-hf-4bit, the loading time should significantly decrease.
The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines:
from azureml.core import Workspace, Datastore
new_name = "llama-2-7b-chat-hf-4bit"
model = ... # load and quantize model like in the code predict_hf.py code above
# save it locally
model.save_pretrained(f"./models/{new_name}")
# double check that the quantized model can be loaded too ...
# If all goes well, upload to blob storage:
workspace = Workspace.from_config()
ds = workspace.get_default_datastore()
ds.upload(f"./models/{new_name}", f"./base_models/{new_name}", show_progress=True, overwrite=True)
# verify the model can be loaded from blob storage by submitting a new prediction job with the new model. See README.md
We currently load the model from the blob storage in the ML workspace: https://autoraml3241530052.blob.core.windows.net/azureml-blobstore-b7ef477b-ca4a-44e3-a029-0e0542bdcd47/base_models/llama-2-7b-chat-hf/
Using this code, which loads the full model and quantize it during load time. This takes about 6 mins.
If we can save a copy of the model with a new name, e.g.
llama-2-7b-chat-hf-4bit
, the loading time should significantly decrease.The easiest way to do this is probably in a notebook that runs in the workspace and then something along these lines: