Open dhruvilgala opened 4 years ago
Instead of loading models for every GET predict call, have the models already loaded on the EC2. The EC2 should wait for the GET call and then run inference on these preloaded models. This will bring latency down to <1 second.
Instead of loading models for every GET predict call, have the models already loaded on the EC2. The EC2 should wait for the GET call and then run inference on these preloaded models. This will bring latency down to <1 second.