Changes to enable discrete actions > 2

This is my attempt to get discrete only actions to work properly on onnx exported models with SB3 and also discrete actions > 2 to work with SB3. As I'm not very familiar with the code, libraries used and ML/RL, I will describe the approach and changes below, however in case of merging a review is requested. I'm marking this as draft as well, so any potential changes can be made before merging. I've only tested this with Stable Baselines and CleanRL example so far.

Update: After a quick test with rllib (using the yaml file from this repo and small file changes to make training and export work for me) with discrete actions only used in the env, the onnx model outputted floats, which wouldn't work with this update (to make it compatible with the update, the onnx model would need to output integers for discrete actions in discrete-only action environments). It's also not likely to work well with the current version as the Godot plugin does not handle logits for discrete-only action spaces (adding the conversion there instead of integrating it into the models may be a potential alternative approach).

This change is connected to the draft PR for plugin (https://github.com/edbeeching/godot_rl_agents_plugin/pull/16/files#) - changes to both repositories are necessary.

Changes to the main repository:

Main changes to godot_rl/core/utils.py: When training using SB3 on environments with only discrete actions, currently SB3 will output integer values for an action (e.g. 0,1,2,3,4), when using CleanRL example, action values will always be continuous/float values, and when using mixed spaces with SB3 the values will be continuous/float also.

To enable Godot to receive action values larger than 2 with only discrete actions, if a discrete action is of type np.int64, it will be added to actions without modification, otherwise the previous logic remains to convert the discrete action from the received float value to 1 or 0 based on value.
Additional check is added to show an error in case of using discrete actions with size > 2 with the CleanRL example (which outputs floats/continuous values). One possible but more complex future solution might be to have two CleanRL algorithm implementations, one for continuous and one for discrete actions, and then select the right one based on the action space.

Main changes to godot_rl/wrappers/onnx/stable_baselines_export.py Currently, when exporting models that have only discrete actions using the sb3 example, the .onnx model will produce logits for the discrete actions, while sb3 model produces integers for each discrete action. This produces an error on the verify_onnx_export assertion.

In this PR, I've added a check to the forward method, and in case the action space is MultiDiscrete, it should select the best action from the distribution (deterministic actions). Otherwise (for continuous or mixed actions), the old behavior is used. I've implemented this after checking the sb3 policies file to see how discrete actions are processed, however I cannot guarantee I've implemented it correctly. In a few small test cases I've tried, this makes the verify_onnx_export tests pass and the model works in Godot.

With that change, if the exported model uses discrete actions only (the example from the image has 2 discrete actions), the following operations are added which should select the action with the highest probability from the logits: Netron_ZXj1dVNn7V

Changes to the plugin repository: ONNXInference.cs:

Previously, the exported onnx model would output float values for actions in case of either continuous or discrete actions used. After the changes above, onnx will output discrete actions as integers if only discrete actions are used in the environment. To handle that, I've made a few changes:

Removed the types from the dictionary that stores the result and the output1 (actions) array, since they can be either int64 (long in c#) or float. An alternative approach would be to have separate methods to get actions if discrete actions are used or continuous/mixed actions are used, which would require additional changes to the gdscript plugin files.
Added using statements to dispose of the session and results disposable values after use.
Replaced the load model method using Godot file access with providing the file path to InferenceSession as it can load the model. This seems to work on my PC on Windows, but I'm not sure if it could break something in some other case.

sync.gd:

Changed clamping (to range -1 1) from all actions received from the onnx model to only continuous actions.
If an action is of type integer (which means a discrete action) it is added to result as is, but if it's supposed to be a discrete action and is not integer (e.g. when mixed actions are used), then it is converted to a binary discrete action.

Simple test of output values when only discrete actions are used with the sb3 example for training after the changes (printing the received actions directly in Godot):

Thanks for the review. I don't have a custom environment for this as I was testing it on a modified example environment from the examples repo. Also since I had to check all 3 cases (continuous only / mixed / discrete only), one environment wouldn't cover checking everything.

However, here are the steps that should get it working quickly:

Use a simple existing environment, either the "RingPong" from custom env tutorial or e.g. "Ball Chase"
Update the plugin by deleting the existing .csproj file, then copying the .csproj and addons folder from the plugin PR branch (other files shouldn't be necessary), then renaming the .csproj to match the name of the example's original .csproj

In the extended AI controller or player script (depending where the action space is defined), modify the get_action_space to cover different test cases, e.g.:

func get_action_space() -> Dictionary:
return {
    "discrete_size_10" : {
        "size": 10,
        "action_type": "discrete"
    },
    "discrete_size_5" : {
        "size": 5,
        "action_type": "discrete"
    },
    }

func get_action_space() -> Dictionary:
    return {
        "discrete_size_2" : {
            "size": 2,
            "action_type": "discrete"
        },
        "continuous_size_2" : {
            "size": 2,
            "action_type": "continuous"
        },
        }

func get_action_space() -> Dictionary:
    # Should throw an error in python, discrete size 2 is current 
    # maximum supported for mixed spaces
    return {
        "discrete_size_10" : {
            "size": 10,
            "action_type": "discrete"
        },
        "continuous_size_1" : {
            "size": 1,
            "action_type": "continuous"
        },
        }

For quick debugging purposes, replace the set_action method in the same file with the following:
```
func set_action(action) -> void:    
print(action)
```

You can use the same setting for testing training and inference on the exported .onnx file.

Finally, if you want to train the agent with e.g. discrete actions only to see if it learns a good policy, set_action will need to be modified accordingly based on the example, but the modification should be simple for RingPong or BallChase.

Additional notes / considerations:

If you try to export to onnx with discrete actions only in the current main branch, you will receive an error for action mismatch as the onnx file will produce logits and the sb3 model will produce a single integer value for the discrete action. With the PR the output should match.
Optionally, if you have e.g. Netron you can also inspect the sb3 exported onnx file to see the split/softmax/argmax operations added to the output in case of discrete actions only in the environment, and not added in case of mixed or continuous actions, which helps to verify what kind of output is coming from the onnx model (my example image in the PR is from an onnx model with 2 discrete actions). The exported onnx model from rllib did not have these added operations, so we may need to adjust it in the future (I'm not yet sure of the best approach as I didn't try rllib before).
In case of discrete actions they will be selected deterministically (using argmax), and also continuous action values are deterministic. Further modification may be needed if we want to add non-deterministic actions on inference in the future (may help prevent the agent from getting stuck in some cases, but I'm not sure about the best approach yet if this is a needed feature).
If we want to have mixed actions where discrete actions are larger than 2, a different approach might be needed (I don't currently know the best way to approach that case).

I've made a small test environment which only outputs action values (includes the updated compatible plugin already). godot-rl libray still needs to be installed from the PR branch.

https://github.com/Ivan-267/godot_rl_agents_examples/tree/test_environments/test_environments/ActionSpacesTestEnvironment

I've made a separate branch in the fork for it since it's a test only environment. I'm not sure to what branch and whether we should merge it, but it might be useful to have some test environments for checking regression later on.

Usage: From Godot, open the "AiController2D" node's script and use any of the predefined (or a custom) dictionary for testing, to validate the output, check the value of actions returned in the Godot console after starting training with SB3 or other libraries (or after exporting and using an .onnx file for inference):

Godot_v4 0 3-stable_mono_win64_MeELTbpqek

In the current PR, the second dictionary mixed_types_larger_than_2 should fail with a descriptive error message in the console from which training was launched, all others should produce valid outputs.

Please hold off on the review as an important issue was spotted by @yaelatletl during his review that needs to be addressed in my experimental branch of the plugin before potential merging.

To keep things synchronized and avoid merge conflicts, once the PR https://github.com/edbeeching/godot_rl_agents_plugin/pull/18 is merged, I will sync my plugin fork and then re-implement the changes necessary to make this PR work in my experimental branch.

@Ivan-267 What is the current state of the PR? Any chance we can clean up the mess I have made? :D

@Ivan-267 What is the current state of the PR? Any chance we can clean up the mess I have made? :D

I am still using a variant of this branch for experimenting with my discrete only scenes locally (without using onnx), but it will need to be reworked before a potential merge (specifically the plugin side, but potentially something here as well).

Since it alters the onnx model to output discrete actions directly when only discrete actions are used, separate handling is necessary for discrete and continuous actions on the plugin side, in both the C# code and the gdscript code. I made a mistake in my implementation of the plugin changes initially, and there are pending changes that are being worked on in the PR I linked above. After that PR is merged I could try to re-apply the changes that handle discrete only actions.

There are also different possible alternative methods to the one used here for certain things, e.g.:

Exporting .onnx can be handled in different ways (for example the policy can be called directly with our observations in the forward method of the OnnxablePolicy, didn't test this for all cases, but raising the onnx opset value helps for some cases)
Discrete action selection from logits could be performed as a post processing step during inference as an alternative to the model itself doing it, however I haven't tested different methods in-depth, also I'm not knowledgeable on RL so I'm not sure which is optimal.

Edit: The plugin was updated. As is, this would still not work with discrete only actions in onnx models exported using rllib (if they provide logits as output), but should work with sb3.

Superseded by https://github.com/edbeeching/godot_rl_agents/pull/181.

The branch can still be kept for reference since this implementation has an advantage, it sends only action values for discrete actions (not logits), so it could be more efficient with large spaces.

edbeeching / godot_rl_agents

Changes to enable discrete actions > 2 #121