NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
972 stars 68 forks source link

Chamfer distance's data source #37

Closed threegold116 closed 2 months ago

threegold116 commented 2 months ago

in the paper “ VILA: On Pre-training for Visual Language Models” 's "The deep embedding alignment hypothesis." part , the Chamfer distance is interesting and useful. And I want to konw how it is calculate ? and what is the image source and the text source? Thank you very mcuh!

tonylins commented 2 months ago

Hi, thanks for your interests in our work!

An example code to measure the Shamfer distance (cosine) is:

x = torch.randn(32, 128)  # N, D
y = torch.randn(32, 128)  # N, D
x = x / torch.norm(x, 1, keepdim=True)
y = y / torch.norm(x, 1, keepdim=True)
sim = x @ y.T  # cosine similarity
dist = 0.5 * (sim.amax(1).mean() + sim.amax(0).mean())

We used a hold out set from the training mix to measure the distance.

threegold116 commented 2 months ago

Thank you very much! I understand the process.