Closed rodrigoalmeida94 closed 2 years ago
We have a 3-band icevision model in pytorch pth format that you can work with to develop the serving for the tile segmentation: https://console.cloud.google.com/storage/browser/ceruleanml/experiments/cv2/05062022_ep10?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=cerulean-338116&prefix=&forceOnObjectsSortingFiltering=false
This isn't a high performing model that can be used to test inference performance, but it can be used to test serving for 3 band images.
Rodrigo will set up example cloud run func with this test model to test our assumptions on inference time.
based on @rodrigoalmeida94 's finding on the cost of running the cloud run inference without classification, we can decide if it is worth building a classification model cloud run function. cc @jonaraphael
@rbavery I wanted to try the model above to check how I should format the inputs and when I tried to load it in my local machine I got the following error:
import torch
model = torch.load("/Users/rodrigoalmeida/cerulean-cloud/cerulean_cloud/cloud_run_offset_tiles/model/experiments_cv2_05062022_ep10_05062022_ep10.pth")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/rodrigoalmeida/.virtualenvs/cerulean-cloud/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/Users/rodrigoalmeida/.virtualenvs/cerulean-cloud/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/Users/rodrigoalmeida/.virtualenvs/cerulean-cloud/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BackboneWithFPN' object has no attribute 'param_groups'
Should I be loading the model in some other way using icevision?
I'm guessing maybe we need to save the model like here so we don't have to pass the model architecture.
Because I was a blocked with the issue above, but wanted to show you inference running with Cloud Run I went ahead and created another cloud run function with what I included in this Gist https://gist.github.com/rodrigoalmeida94/3c2f5f96666bd23374e28f9cc31449cc
The model.pt file that is reference can be found in this file in gcs. I generated this "nonsense" model with:
from torchvision import models
import torch
model = models.resnet18(pretrained=True)
sm = torch.jit.script(model)
sm.save("resnet-18.pt")
This is the Cloud Run URL https://torch-inference-5qkjkyomta-ey.a.run.app , https://console.cloud.google.com/run/detail/europe-west3/torch-inference/metrics?project=cerulean-338116.
Warm up can take up to 4s, but once the function is warm we get inference response in 300-400ms.
awesome. assigning to @lillythomas as well to pair up with you when you return. We'll be using the unet model to test for now instead of the icevision model.
@rodrigoalmeida94 I've PRed code to save and load torch models here: https://github.com/SkyTruth/cerulean-ml/pull/85/files
and an example model trained for 1 epoch is at the mounted ceruleanml bucket under /root/data/experiments/cv2/20_May_2022_19_29_39_fastai_unet/tracing_test_1batch_18_512_0.125.pt
The output of a model loaded with torch tracing are the logits with shape [1, 7, 512, 512] (1 batch, 7 classes, though we will later not include ambiguous).
These are softmaxed and then argmaxed to get an array representing the confidence scores and the maximally confident classes with shape [1, 512, 512]
I'll add docstrings to the above PR monday to wrap it up but feel free to use these funcs now @rodrigoalmeida94
@rbavery thanks a lot for this! I've tried running this locally in my machine, and when I use the load_tracing_model
function I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/rodrigoalmeida/cerulean-cloud/cerulean_cloud/cloud_run_offset_tiles/handler.py", line 28, in get_model
return load_tracing_model("model/model.pt")
File "/Users/rodrigoalmeida/cerulean-cloud/cerulean_cloud/cloud_run_offset_tiles/handler.py", line 22, in load_tracing_model
tracing_model = torch.jit.load(savepath)
File "/Users/rodrigoalmeida/.virtualenvs/cerulean-cloud/lib/python3.8/site-packages/torch/jit/_serialization.py", line 161, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
CPU: registered at aten/src/ATen/RegisterCPU.cpp:18433 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:12703 [kernel]
BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:665 [kernel]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:10483 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:11423 [kernel]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:466 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Could this be some configuration error? Seems like it's expecting some CUDA backend. In the ideal world we would be able to load this model using an environment that only includes the torch
package.
I'll check on this I think there may be a setting needed when exporting to make it use the cuda device when exporting that I didn't use.
This is actually because I saved the model with tracing with a gpu. I can save it for CPU. Tracing models are not architecture agnostic so I should include that info in the file name.
@rodrigoalmeida94 Ok this model should work, I tested locally on my mac: /root/data/experiments/cv2/24_May_2022_01_49_56_fastai_unet/tracing_cpu_test_1batch_18_512_0.082.pt
>>> torch.jit.load("../tracing_cpu_test_1batch_18_512_0.082.pt")
RecursiveScriptModule(
original_name=DynamicUnet
(layers): RecursiveScriptModule(
original_name=ModuleList
(0): RecursiveScriptModule(
original_name=Sequential
(0): RecursiveScriptModule(original_name=Conv2d)
(1): RecursiveScriptModule(original_name=BatchNorm2d)
........
something to note, if the instance has a gpu and the inference is being run on a pytorch dataloader, then the dataloader will need to be moved to the cpu like so
import torch
experiment_dir = '/root/data/experiments/cv2/24_May_2022_01_49_56_fastai_unet/'
savename = "tracing_cpu_test_1batch_18_512_0.082.pt"
tracing_model = load_tracing_model(os.path.join(experiment_dir, savename))
out_batch_logits = test_tracing_model_one_batch(dls.to('cpu'), tracing_model)
I could load the model @rbavery ! Thanks so much 👍
@rbavery what is this model expecting as input?
I passed a tensor with shape [1,1,512,512] and got RuntimeError: Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 1, 512, 512] to have 3 channels, but got 1 channels instead
. I suppose it wants a 3 band image, but is this the composite of VV and aux datasets or just the VV band represented as RGB? More relevant for performance, I'll keep on developing with a dummy array.
@rodrigoalmeida94 It's expecting the 3 band input, where auxillary datasets are separate channels.
This notebook demonstrate the working version of the cloud run function for inference with the pytorch model Ryan provided me. https://github.com/SkyTruth/cerulean-cloud/blob/cloud-run-inference/notebooks/test_cloud_run_offset_tile.ipynb
A couple steps in here:
The pulumi deployment code is also in this branch, but I'm facing a pesky race condition issue while building docker images - I've reported this https://github.com/pulumi/pulumi-docker/issues/245 and waiting for feedback (this deployment I did manually).