Add croco visual encoder and ImageNav scripts

This PR adds croco to visual encoders. The original CrocoNet was a combined model containing encoder-decoder and masking logic. Here, it has been separated for the purpose of goal embedding caching and the masking logic has been removed as we will not be pretraining.

Additionally, we need a new binocular encoder for goal+obs embedding, which has been added in this PR.

To-do:

[ ] Test goal caching
[ ] Test cached sensor
[ ] Update or write a new policy which uses the binocular encoder embeddings instead of goal embeddings.
[ ] Test the policy

Ram81 / goat-bench

Add croco visual encoder and ImageNav scripts #1