Repository for environment encoder, an attempt at improving reinforcement learning agents' generalisability through learning how to act on universal multimodal embeddings generated by a vision-language model.
Right now we're just using hugging face's normal inferencing procedure alongside flash attention-2, this is wholly inefficient compared to what we could achieve.
Integrating a framework like vLLM would be nice, but sadly they don't support getting a model's hidden_state. Have to research around.
Right now we're just using hugging face's normal inferencing procedure alongside flash attention-2, this is wholly inefficient compared to what we could achieve.
Integrating a framework like vLLM would be nice, but sadly they don't support getting a model's
hidden_state
. Have to research around.