Repository for environment encoder, an attempt at improving reinforcement learning agents' generalisability through learning how to act on universal multimodal embeddings generated by a vision-language model.
This branch ballooned out of scope massively, oops.
Made some huge changes:
Implemented batch inferencing for VLMs.
Added a new VLM: Idefics2, which should be much better than tinyllava although the embedding is quite large (4096).
Improved steps per second (SPS) from 20 -> 30 -> 50 -> ~100 when using a VLM!
Using envpool file as main, which increases throughput generally, with non-VLM settings going up to ~3000 SPS.
Managed to find hyper parameters where agent converges for a very high number of environments.
Non-VLM setting uses CNN, VLM uses MLP. So it basically acts as a linear layer appended to the end of the VLM in a way.
Added bitsandbytes quantisation for tinyllava.
Added LSTM network implementation in legacy files, but currently unused.
Added a lot of arguments to help VLM selection and whether to quantise it or not.
Cleaned up codebase to make model loading and inferencing model-agnostic. Implementing new models should be easier.
Added a basic auto encoder network. Currently unused.
Made embedding extraction more meaningful by following how the current literature decides to do it. (Last token with a specific prompt instead of merging the weights).
Generally cleaned up the codebase a ton so it's easier to read and parse through.
Fixed a lot of bugs.
There's a lot more changes too, so I won't squash the commits just so I can read through it again, even though I technically should.
This branch ballooned out of scope massively, oops.
Made some huge changes:
There's a lot more changes too, so I won't squash the commits just so I can read through it again, even though I technically should.