Implement DINO strategy for learning.

This PR changes the learning method (we do not change the architecture or outputs) from using the MAE (Masked Autoencoder) to the DINO (Distillation with No Labels) approach.

Background on MAE: MAE operates on the principle of masking a significant portion of the input data (typically 75%) and training the model to reconstruct these missing parts. This approach encourages the model to learn representations based on the context provided by the unmasked portions, leveraging transformer technology to generate detailed embeddings for each data patch. In scenarios where unique features are isolated within single patches, it might not always effectively infer their presence.

DINO: DINO shifts the focus from reconstruction to a student-teacher framework (two models running in parallel). Here, the "student" model learns to replicate the output of the "teacher" model, which itself is an aggregate of the student model's past iterations. This method emphasizes learning from the entirety of the input data, as opposed to focusing on the missing parts, aiming to refine the model's understanding and representation capabilities.

Key Differences and Advantages:

Holistic Learning vs. Reconstruction by extrapolation: Unlike MAE, where learning is driven by the need to fill in gaps, DINO encourages the model to understand the full scope of the input data.
Dynamic Updating: The teacher model in DINO is dynamically updated, slowing moving the target towards better reconstructions.

Patch-Level Embeddings: Both MAE and DINO generate detailed embeddings at the patch level, but DINO is able to capture more nuanced patterns within and around each patch, informed by the accumulated tries of the teacher model.

DINO downsides:

We need to mantain 2 copies while trainning, so more memory footprint.
The target is not fixed, so it might need more computation to converge.
More hyperparameters of each model and their interaction.
MAE is sensitive to the smallest and most unique features. DINO, since it always looks at the whole image, might not give due attention to rare small features.

Currently running a small experiment over Bali with DINO and then I'll do same with MAE and compare runs.

Clay-foundation / model

Implement DINO strategy for learning. #203