[x] Prepare scaling plots until end of february. Y-axis: the speedup we get when running one epoch through the model for 2,4,6,8,10 GPUs
[x] Find out how many samples we have in the plankton dataset (Lorenz) . 3,423,255 samples
[x] Implement Nav to enable training with different resolutions (no padding)
[x] test restarting training from ckpt (with or without optimizer state?)

[x] Improve GPU Utilization
[ ] Turn benchmark notebook into Python script and run benchmark on cluster (Lucas)
[x] improve data loading / pre fetching / augmentation library / data format (lmdb)
[x] if enough CPU power available: augmentations back to CPU to save GPU mem
[ ] Try ffcv (Jerome)
[x] check if we can change crop size if we don’t use pretrained models
[x] Change dataloader from python for-loop to pandas concat dataframe
[x] Benchmark DDP vs FSDP (DDP slightly faster but not crucial)
[x] Investigate slow down with more data (ie. 1/5 of dataset runs faster than full ds)
[x] Slice Input to use only one dimension (Nora)
[x] Use torch Profiler (may need Pytorch 2.1 bc bug w 2.0.0)
[x] Cache dataset to RAM (Jerome)
[x] Investigate FSDP chunk number (Jerome)
[x] Check if non 2^x batch size slows down the training (would be useful to optimize GPU mem usage)
[x] Merge Lorenz pandas loop

[ ] train fully supervised model (ViT, CNN) to get upper bound
[x] evaluate (knn) using checkpoints from models trained on Plankton data vs imagenet pre-trained models
[ ] linear evaluation (Jerome)
[x] Add confusion matrix (Nora)
[ ] Look at wrongly classified images (Nora)
[ ] clustering of embeddings (look at shapes, use https://biigle.de/ ?)
[ ] visualization of attention maps / intermediate features (Lucas is interested in looking at that)
[ ] Unsupervised clustering algo

[ ] Try masked auto-encoder instead of DinoV2 and test performance against DinoV2
[ ] Add NaViT rescaling of positional embedding (PE) and masking of padded tokens or only re-scaling of PE in the style of timm.
[ ] Add Ray hyperparameter tuning

Tasks

We should try only to do minimal code changes!

[x] Dataset Pre-Processing, creating np files, make all images same size and fill with padding (@JLrumberger)
[x] Build Dataset class for WHOI (kwargs should include options to provide multiple paths)
[x] adapt make_dataset function (function in loaders)
[x] adapt where make_dataset is called in train.py (e.g. if it's not string, concatenate datasets)

[x] download mini subset of ImageNet (@jeromel05)
[ ] try dinov2 repo without changes on ImageNet
[x] adapt run/train/train.py to torchrun or Lightning for GPU parallelization (@jeromel05)

Kainmueller-Lab / plankton-dinov2