coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
180 stars 25 forks source link

feat(azure): Preliminary support for Azure #118

Open wbrown opened 6 months ago

wbrown commented 6 months ago

This PR adds Azure Blob Storage support to tensorizer. Both serialization and deserialization work, very similarly to the S3 support that already exists:

Parameterization is done using the DefaultAzureCredential class. It picks up the credentials via various automated mechanisms. The one that was tested was EnvironmentCredential.

Usage is straightforward -- provide it with an azure:// uri in the form of azure://account/container/blob and it will take it from there.

import tensorizer.serialization as serialization
from transformers import AutoModelForCausalLM
import azure.core.exceptions

model = AutoModelForCausalLM.from_pretrained("eleutherai/gpt-neo-125m")
print("Model loaded.")
serializer = serialization.TensorSerializer(
    "azure://test/data/gpt-neo-125m",
)
try:
    serializer.write_module(model)
    serializer.close()
    print("Done serializing to Azure!")
except azure.core.exceptions.ResourceExistsError:
    print("Resource already exists.")
deserialize = serialization.TensorDeserializer(
    "azure://test/data/gpt-neo-125m",
    verify_hash=True,
)
deserialize.load_into_module(model)
print("Model deserialized from Azure!")

This PR is not complete -- test cases still need to be written, which is vastly complicated by the lack of a library similar to AWS' moto3 for mocking up interfaces.