Generating Fig.5 and Fig.6

facebookresearch / sound-spaces

A first-of-its-kind acoustic simulation platform for audio-visual embodied AI research. It supports training and evaluating multiple tasks and applications.

Creative Commons Attribution 4.0 International

338 stars 55 forks source link

Hi @ChanganVR,

Thanks for sharing the awesome work!

Regarding paper "SoundSpaces: Audio-Visual Navigation in 3D Environments", I'm interested in looking into more from the visualisation analysis, e.g. the "t-SNE" in Fig 5, and the "Impact of each modality on action selection" in Fig 6. Based on the description in the paper, I have some confusions for producing these figures.

For example, in Fig 5, are the t-SNE samples accumulated over each agent step of all the episodes from test set? Is the "distance in meters" the relative distance between agent's location at each step and the target goal from all the episodes? So the goal always changes? Similar to the angle?

Fig 6, "ablate each modality in turn by replacing it with average training sample value", do you compute an average feature vector over encoded audio/rgb feature vector from all the training samples? are these samples from all the agents steps (or all the navigable points) and all the episodes? And how the absolute difference of logarithmic action probability is calculated?

Also thought to ask if you could share the related script? Even a “unpolished” version would be really helpful.

Best, Lingyu

Hi @ly-zhu,

In Fig 5, t-SNE is computed over about 10k randomly sampled source-locations pairs across all environments with the trained model. And thus it is not accumulated over time. The "distance in meters" metric is the distance to the sampled sound source location.

For Fig.6, if I remember correctly, I computed the average values of RGB/audio across training samples. Again, this is randomly sampled from training environments rather than from collected episodes. For the absolute differnce of log probability, I first take the log of action probably of w/ RGB, and then compute the absolute difference with its counterpart (w/ avg. RGB). And I do the same for audio and normalize their scores. In this way, the modality impacted more will have larger difference and larger final rescore.

I'd like to share my script but I couldn't find it since I wrote it almost two years ago. If you follow my descriptions above, you should be able to obtain a similar result.

Let me know what else is needed from me.

Thanks, Changan

facebookresearch / sound-spaces

Generating Fig.5 and Fig.6 #67