aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
385 stars 82 forks source link

Support for parquet encoder and decoder #127

Open lorenzwalthert opened 1 year ago

lorenzwalthert commented 1 year ago

Describe the feature you'd like Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn():

   def default_input_fn(self, input_data, content_type, context=None):
        """A default input_fn that can handle JSON, CSV and NPZ formats.

        Args:
            input_data: the request payload serialized in the content_type format
            content_type: the request content_type
            context (obj): the request context (default: None).

        Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
        """
        return decoder.decode(input_data, content_type)

Looking into decoder.decode, I see the following MIME types are supported:

_decoder_map = {
    content_types.NPY: _npy_to_numpy,
    content_types.CSV: _csv_to_numpy,
    content_types.JSON: _json_to_numpy,
    content_types.NPZ: _npz_to_sparse,
}

Should not be too hard to add parquet here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.

How would this feature be used? Please describe. Reduce storage costs, data I/O costs, increase speed while processing.

Describe alternatives you've considered

CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.

Additional context