Most of the OpenAI-specific functions have been moved into OpenAIModule.
Converting an image into base64.
Getting image sizes.
OpenAIModule can take any order of vision-language inputs to prepare for the wide range of input types in OpenX.
Now, multiple text inputs can be taken, so that they can be used for different modalities such as discreate observations, text observations, or rewards.
VLMModule has been implemented.
It converted any data with the type starting with "image...." into image data. The rest of the inputs are considered as text inputs and converted into text "{type}: {value}". (e.g. "discrete_observation: [1, 0, 2, 1, 0, 2]")
It also gets the k-shots examples so that it properly updates the history with the examples. The k-shots examples have extra fields to indicate whether they are user inputs or outputs to map them into the model history as intended.
"discrete_observation: [1, 0, 2, 1, 0, 2]"
)