Regarding the emergence property

I want some understanding/ intuition on the emergence behaviour of the model.

As I understand it, the loss simply ensures that the output of the different transformations of the input image to remain close to each other. This can surely make the model learn foreground-background separation (as shown in DINOv1). However, DINOv2 exhibits emergence behaviour where it can also learn the semantic meaning for the parts of the objects -- for eg, in Fig 1, the visualization shows the same colour gradient for the wings of various birds and planes.

What leads to the emergence of this behaviour, and how does the loss encourage it?

facebookresearch / dinov2

Regarding the emergence property #406