Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

Chamfer distance's data source #37

Closed threegold116 closed 1 month ago

threegold116 commented 1 month ago

in the paper “ VILA: On Pre-training for Visual Language Models” 's "The deep embedding alignment hypothesis." part , the Chamfer distance is interesting and useful. And I want to konw how it is calculate ? and what is the image source and the text source? Thank you very mcuh!

tonylins commented 1 month ago

Hi, thanks for your interests in our work!

An example code to measure the Shamfer distance (cosine) is:

x = torch.randn(32, 128)  # N, D
y = torch.randn(32, 128)  # N, D
x = x / torch.norm(x, 1, keepdim=True)
y = y / torch.norm(x, 1, keepdim=True)
sim = x @ y.T  # cosine similarity
dist = 0.5 * (sim.amax(1).mean() + sim.amax(0).mean())

We used a hold out set from the training mix to measure the distance.

threegold116 commented 1 month ago

Thank you very much! I understand the process.