- Add a Timestamp to Your Export Path: It is recommended to add the timestamp of the export time to the export path for the Keras model when you are manually saving the model.
```python
import tensorflow as tf
import time
model_path = trainer.outputs.model.get()[0].uri + '/Format-Serving'
model = tf.keras.models.load_model(model_path)
ts = int(time.time())
file_path = "./saved_models/{}".format(ts)
tf.keras.models.save_model(model, file_path, save_format="tf")
Native Ubuntu Installation
The installation steps are similar to other nonstandard Ubuntu packages. First, you need to add a new package source to the distribution's source list or add a new list file to the sources.list.d directory by executing the following in your Linux terminal:
import sys
# We need sudo prefix if not on a Google Colab.
if 'google.colab' not in sys.modules:
SUDO_IF_NEEDED = 'sudo'
else:
SUDO_IF_NEEDED = ''
# This is the same as you would do from your command line, but without the [arch=amd64], and no sudo
# You would instead do:
# echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \
# curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
- Single Model Configuration: If you want to run TensorFlow Serving by loading a single model and switching to newer model versions when they are available, the single model configuration is preferred.
- You can run it with the command:
- By default, TensorFlow Serving is configured to create a representational state transfer (REST) and Google Remote Procedure Calls (gRPC) endpoint. By specifying both ports, 8500 and 8501, we expose the REST and gRPC capabilities. To run the server in a single model configuration, you need to specify the model name `--model_name=`
- By default, TensorFlow Serving will load the model with the highest version number. If you use the export methods shown earlier, all models will be exported in folders with the epoch timestamp as the folder name. Therefore, newer models will have a higher version number than older models.
- Multiple Model Configuration
- You can also configure TensorFlow Serving to load multiple models at the same time. To do that, you need to create a configuration file to specify the models:
- Configure Specify Model Versions
- There are situations when you want to load not just the latest model version, but either all or specify model versions. If you want to load a set of available model versions, you can extend the model configuration file with:
- You can even give the model version labels. The labels can be extremely handy later when you want to make predictions from the models. At the time of writing, version labels were only available through TensorFlow Serving's gRPC endpoints:
- URL structure
- The URL for your HTTP request to the model server contains information about which model and which version you would like to infer: `http://{HOST}:{PORT}/v1/model/{MODEL_NAME}[/version/${MODEL_VERSION}]:{VERB}`
- HOST: The host is the IP address or domain name of your model server. If you run your model server on the same machine where you run your client code, you can set the host to localhost.
- PORT: You'll need to specify the port in your request URL. The standard port for the REST API is 8501. If the conflicts with other services in your service ecosystem, you can change the port in your server arguments during the startup of the server.
- MODEL_NAME: The model name needs to match the name of your model when you either set up your model configuration or started up the model server.
- VERB: The type of model is specified through the verb in the URL. You have three options: predict, classify, or regress. The verb corresponds to the signature methods of the endpoint.
- MODEL_VERSION: If you want to make predictions from a specific model version, you'll need to extend the URL with the model version identifier
- Payloads
- With the URL in place, let's discuss the request payloads. TensorFlow Serving expects the input data as a JSON data structure, as shown in the following example:
{
"signature_name": ,
"instances":
}
- The signature_name is not required. If it isn't specified, the model server will infer the model graph signed with the default serving label.
- The input data is expected either as a list of objects or as a list of input values. To submit multiple data samples, you can submit them as a list under the instances key.
- If you want to submit one data example for the inference, you can use inputs and list all input values as a list. One of the keys, instances and inputs, has to be present, but never both at the same time:
{
"signature_name": ,
"inputs":
}
- Example model prediction request with a Python client
```python
import requests
def get_rest_request(text, model_name='my_model'):
# Exchange localhost with an IP address if the server is not running on the same machine
url = "http://localhost:8501/v1/models/{}:predict".format(model_name)
# Add more examples to the instance list if you want to infer more samples
payload = {"instances": [text]}
response = requests.post(url=url, payload=payload)
return response
rs_rest = get_rest_request(text="classify my text")
rs_rest.json()
Using Tensorflow Serving via gRPC
If you want to use the model with gRPC, the steps are slightly different from the REST API requests.
First, you establish a gRPC channel. The channel provides the connection to the gRPC server at a given host address and over a given port. If you require a secure connection, you need to establish a secure channel at this point. Once the channel is established, you'll create a stub. A stub is a local object which replicates the available methods from the server:
import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf
- Once the gRPC stub is created, we can set the model and the signature to access predictions from the correct model and submit our data for the inference:
```python
def grpc_request(stub, data_sample, model_name="my_model", signature_name="classification"):
request = predict_pb2.PredictRequest()
request.model_spec.name = model_name
request.model_spec.signature_name = signature_name
# inputs is the name of the input of our neural network.
request.inputs['inputs'].CopyFrom(tf.make_tensor_proto(data_sample,
shape=[1,1]))
# 10 is the max time in seconds before the function times out
result_future = stub.Predict.future(request, 10)
return result_future
With the two function, now available, we can infer our example datasets with the two function calls:
A/B testing is an excellent methodology to test different models in real-life situations. In this scenario, a certain percentage of clients will receive predictions from model version A and all other requests will be served by model version B.
We discussed earlier that you could configure TensorFlow Serving to load multiple model versions and then specify the model version in your REST request URL or gRPC specifications.
TensorFlow Serving doesn't support server-side A/B testing, meaning that the model server will direct all client requests to a single endpoint to two model versions. But with a little tweak to our request URL, we can provide the appropriate support for random A/B testing from the client side:
# The random library will help us pick a model.
from random import random
Submit 10% of all requests from this client to version 1.
90% of the requests should go to the default models.
threshold = 0.1
If version = None, TensorFlow Serving will infer with the default version.
version = 1 if random() < threshold else None
url = get_rset_url(model_name='complaints_classification', version=version)
- As you can see, randomly changing the request URL for our model inference (in our REST API example), can provide you some basic A/B testing functionality. If you would like to extend these capabilities by performing the random routing of the model inference on the server side, we highly recommend routing tools like [Istio](https://istio.io) for this purpose. Originally designed for web traffic, Istio can be used to route traffic to specific models. You can phase in models, perform A/B tests, or create policies for data routed to specific models.
- Requesting Model Metadata from the Model Server
- The metadata provided by the model server will contain the information to annotate your feedback loops.
- REST Requests for Model Metadata
- Requesting model metadata is straightforward with TensorFlow Serving. TensorFlow Serving provides you an endpoint for model metadata:
- Example model metadata request with a python client
```python
import requests
def metadata_rest_request(model_name, host="localhost", port=8501, version=None):
url = "http://{}:{}/v1/models/{}/".format(host, port, model_name)
if version:
url += "versions/{}".format(version)
# Append /metadata for model information
url += "/metadata"
# Perform a GET request
response = requests.get(url=url)
return response
Batching Inference Requests:
Batching Inference request is one of the most powerful features of TensorFlow Serving. During model training, batching accelerates our training because we can parallelize the computation of our training samples. At the same time, we can also use the computation hardware efficiently if we match the memory requirements of our batches with the available memory of the GPU
Configuring Batch Predictions
Batching predictions needs to be enabled for TensorFlow Serving and then configured for your use case. You have five configuration options:
max_batch_size: This parameter controls the batch size. Large batch sizes will increase the request latency and can lead to exhausting the GPU memory. Small batch sizes lose the benefit of using optimal computation resources.
batch_timeout_micros: This parameter sets the maximum wait time for filling a batch. This parameter is handly to cap the latency for inference requests.
num_batch_threads: The number of threads configures how many CPU or GPU cores can be used in parallel
max_enqueued_batches:This parameter sets the maximum number of batches queued for predictions. This configuration is benefical to avoid un reasonalble backlog of requets. If the maximum number is reached, requests will be returned with an error instead of being queued.
pad_variable_length_inputs: This Boolean parameter determines if input tensors with variable lengths will be padded to the same lengths for all input tensors.
You can set the parameters in a text file, as shown in the following example. In our example, we create a configuration file called batching_parameters.txt and add the following content:
If you want to enable batching, you need to pass two additional parameters to the Docker container running TensorFlow Serving. To enable batching, set enable_batching to true and set batching_parameters_file to the absolute path of the batching configuration file inside of the container. Please keep in mind that you have to mount additional folder with the configuration file if it isn't located in the same folder as the model versions.
TensorFlow Serving comes with a variety of additional optimization features. Additional feature flags are:
--file_system_poll_wait_seconds=1: TensorFlow Serving will poll if a new model version is available. You can disable the feature by setting it to 1. If you only want to load the model once and never update it, you can set it to 0. The parameter expects an integer value. If you load models from cloud storage buckets, we highly recommend that you increase the polling time to avoid unnecessary cloud provider charges for the frequent list operations on the cloud storage bucket.
--tensorflow_session_parallelism=0: TensorFlow Serving will automatically determine how many threads to use for a TensorFlow session. In case you want to set the number of a thread manually, you can overwrite it by setting this parameter to any positive integer value.
--tensorflow_intra_op_parallelism=0: This parameter sets the number of cores being used for running TensorFlow Serving. The number of available threads determines how many operations will be parallelized. If the value is 0, all available cores will be used.
--tensorflow_inter_op_parallelism=0: This parameter sets the number of available threads in a pool to execute TensorFlow ops. This is useful for maximizing the execution of independent operations in a TensorFlow graph. If the value is set to 0, all available cores will be used and one thread per core will be allocated.
eval_data = tf.data.TFRecordDataset('/content/tfx/Transform/transformed_examples/5/Split-eval/transformed_examples-00000-of-00001.gz', compression_type="GZIP")
subset = eval_data.take(1)
eval_examples = [tf.train.Example.FromString(d.numpy()) for d in subset]
Description
Actions
model_path = trainer.outputs.model.get()[0].uri + '/Format-Serving' model = tf.keras.models.load_model(model_path) file_path = "./saved_models/1" tf.keras.models.save_model(model, file_path, save_format="tf")
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | {SUDO_IF_NEEDED} tee /etc/apt/sources.list.d/tensorflow-serving.list && \ curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | {SUDO_IF_NEEDED} apt-key add - !{SUDO_IF_NEEDED} apt update
!{SUDO_IF_NEEDED} apt-get install tensorflow-model-server
%%bash --bg nohup tensorflow_model_server \ --rest_api_port=8501 \ --model_name=my_model \ --model_base_path=/content/saved_models >server.log 2>&1
model_config_list { config { name: 'my_model' base_path: '/models/my_model/' model_platform: 'tensorflow' } config { name: 'another_model' base_path: '/models/another_model/' model_platform: 'tensorflow' } }
$ tensorflow_model_server --port=8500 \ --rest_api_port=8501 \ --model_config_file=/models/model_config
... config { name: 'another_model' base_path: '/models/another_model/' model_version_policy: {all: {}} } ...
... config { name: 'another_model' base_path: '/models/another_model/' model_version_policy: { specific { versions: 1556250435 versions: 1556251435 } } } ...
... model_version_policy: { specific { versions: 1556250435 versions: 1556251435 } } version_labels { key: 'stable' value: 1556250435 } version_labels { key: 'testing' value: 1556251435 } ...
{ "signature_name":,
"instances":
}
{ "signature_name":,
"inputs":
}
def create_grpc_stub(host, port=8500): hostport = "{}:{}".format(host, port) channel = grpc.insecure_channel(hostport) stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) return stub
def get_rest_url(model_name, host="localhost", port=8501, verb='predict', version=None): url = "http://{}:{}/v1/models/{}/".format(host, port, model_name) if version: url += "versions/{}".format(version) url += ":{}".format(verb) return url ...
Submit 10% of all requests from this client to version 1.
90% of the requests should go to the default models.
threshold = 0.1
If version = None, TensorFlow Serving will infer with the default version.
version = 1 if random() < threshold else None url = get_rset_url(model_name='complaints_classification', version=version)
http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/{MODEL_VERSION}]/metadata
max_batch_size
: This parameter controls the batch size. Large batch sizes will increase the request latency and can lead to exhausting the GPU memory. Small batch sizes lose the benefit of using optimal computation resources.batch_timeout_micros
: This parameter sets the maximum wait time for filling a batch. This parameter is handly to cap the latency for inference requests.num_batch_threads
: The number of threads configures how many CPU or GPU cores can be used in parallelmax_enqueued_batches
:This parameter sets the maximum number of batches queued for predictions. This configuration is benefical to avoid un reasonalble backlog of requets. If the maximum number is reached, requests will be returned with an error instead of being queued.pad_variable_length_inputs
: This Boolean parameter determines if input tensors with variable lengths will be padded to the same lengths for all input tensors.batching_parameters.txt
and add the following content:enable_batching
to true and setbatching_parameters_file
to the absolute path of the batching configuration file inside of the container. Please keep in mind that you have to mount additional folder with the configuration file if it isn't located in the same folder as the model versions.--file_system_poll_wait_seconds=1
: TensorFlow Serving will poll if a new model version is available. You can disable the feature by setting it to 1. If you only want to load the model once and never update it, you can set it to 0. The parameter expects an integer value. If you load models from cloud storage buckets, we highly recommend that you increase the polling time to avoid unnecessary cloud provider charges for the frequent list operations on the cloud storage bucket.--tensorflow_session_parallelism=0
: TensorFlow Serving will automatically determine how many threads to use for a TensorFlow session. In case you want to set the number of a thread manually, you can overwrite it by setting this parameter to any positive integer value.--tensorflow_intra_op_parallelism=0
: This parameter sets the number of cores being used for running TensorFlow Serving. The number of available threads determines how many operations will be parallelized. If the value is 0, all available cores will be used.--tensorflow_inter_op_parallelism=0
: This parameter sets the number of available threads in a pool to execute TensorFlow ops. This is useful for maximizing the execution of independent operations in a TensorFlow graph. If the value is set to 0, all available cores will be used and one thread per core will be allocated.Estimate
Tests