Scope out a path for profiling OpenVLA

OpenX datasets used by OpenVLA and MultiNet v0 comparison:

OpenVLA: Notes:

They removed the 10% DROID dataset in the final third of the training for the final model due to low action token accuracy
The training data only contains manipulation datsets with at least one 3rd person camera and use single-arm end-effector control
Follows Octo and up-weights larger tasks with scene diversity and down-weights less diverse datasets
Input 224 x 224px image and text instruction, output 7D action space

I think the next steps would be finding appropriate datasets from MultiNet v0 OpenX for simple eval on OpenVLA

ManifoldRG / MultiNet