DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.9k stars 495 forks source link

[Question] exported ONNX model does not result in same output as the original pytorch model #394

Open VineetTambe opened 11 months ago

VineetTambe commented 11 months ago

❓ Question

I am trying to export the trained pytorch model to onnx so that I can deploy it. But I am facing some issues where the output of the exported model is not the same as the pytorch model when I run a episode. I have made sure that I set the model to eval mode before exporting. I heavily modified the enjoy.py script to export and run the models.

Exporting to ONNX:

    torch_model = ALGOS[algo].load(
        model_path, custom_objects=custom_objects, device=args.device, **kwargs
    )
    torch_model.policy.eval()

    obs = env.reset()

    obs_tensor = torch_model.policy.obs_to_tensor(obs)[0]
      # Export the model
      torch.onnx.export(
          torch_model.policy,  # model being run
          obs_tensor,  # model input (or a tuple for multiple inputs)
          output_model_name,  # where to save the model (can be a file or file-like object)
          export_params=True,  # store the trained parameter weights inside the model file
          opset_version=10,  # the ONNX version to export the model to
          do_constant_folding=True,  # whether to execute constant folding for optimization
          input_names=["input"],  # the model's input names
          output_names=["output"],  # the model's output names
          dynamic_axes={
              "input": {0: "batch_size"},  # variable length axes
              "output": {0: "batch_size"},
          },
      )

Running inference using the onnx model:

    ort_session = onnxruntime.InferenceSession(onnx_model_path)
    ort_inputs = {ort_session.get_inputs()[0].name: obs}
    action = ort_session.run(None, ort_inputs)[0]
    obs, reward, done, infos = env.step(action)

The above are the only modifications done to enjoy.py in order to export and run the model. However, the results of the trained agent is not same. Am I missing something obvious here? Any help would be greatly appriciated!

Checklist

araffin commented 11 months ago

Hello, could you be more specific on which algo/env you are using ?

VineetTambe commented 11 months ago

Hey,

I am using the qrdqn algo and a custom environment based on top of the minigrid env

araffin commented 11 months ago

I am using the qrdqn algo and a custom environment based on top of the minigrid env

Could you share the observation and action spaces?

You are probably missing pre-processing, see https://github.com/DLR-RM/stable-baselines3/issues/1349#issuecomment-1446161768 (we welcome a PR that updates our doc).

VineetTambe commented 11 months ago

Could you share the observation and action spaces?

Observation space: Box(0, 255, (50,), uint8)
Action Space:          Discrete(4)

You are probably missing pre-processing I tried doing what is done in the comment linked - which is create a new pytorch model class which has the the policy preprocessing step in the forward pass (please correct me if I am wrong here)

What exactly does the pre-processing entail? Is there anything more to it? Because even after doing the above step I get the same incorrect results. Is there any postprocessing step that I might be missing?

araffin commented 11 months ago

You are probably either missing image pre-processing (dividing by 255 before feeding to the network) or are not comparing to the greedy policy.

The following works and was tested comparing the quantiles returned:

import numpy as np
import torch as th
from sb3_contrib import QRDQN

model = QRDQN("MlpPolicy", "LunarLander-v2")
model.policy.to("cpu")
# Note: by default model.policy.quantile_net.forward() returns quantiles
onnxable_model = model.policy
observation_size = model.observation_space.shape[0]

dummy_input = th.randn(1, observation_size)
onnx_path = "qrdqn_model.onnx"
th.onnx.export(
    onnxable_model,
    dummy_input,
    onnx_path,
    opset_version=17,
    input_names=["input"],
)

##### Load and test with onnx

import numpy as np
import onnx
import onnxruntime as ort

onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)

# observation = np.zeros((1, observation_size)).astype(np.float32)
observation = dummy_input.cpu().numpy()
ort_sess = ort.InferenceSession(onnx_path)
action = ort_sess.run(None, {"input": observation})[0]

print(action)
print(model.predict(observation, deterministic=True)[0])