Support for Vision Model such as ViT

Hi,

Thanks for this wonderful codebase of model merging. I'd like to use it to merge vision models, specifically models that share nearly the same architecture but trained with different objectives, such as CLIP ViT and Dinov2 ViT. I'd like to contribute to the codebase so I plan to write code for this on my own. However, as I read through the codebase, I realize how complex the codebase is. Layers of encapsulation are really a daunting barrier for me to make sensible contribution. Could you please advise me on how to make changes to accommodate my use case?

Thanks a lot!

arcee-ai / mergekit

Support for Vision Model such as ViT #423