Understanding AAViSS specific dataset splits

SAGNIKMJR / move2hear-active-AV-separation

Code and datasets for 'Move2Hear: Active Audio-Visual Source Separation' (ICCV 2021)

MIT License

13 stars 0 forks source link

Understanding AAViSS specific dataset splits #5

Closed sreeharshaparuchur1 closed 1 year ago

sreeharshaparuchur1 commented 1 year ago

Hi @SAGNIKMJR ,

I have several questions about the AAViSS dataset splits:

In the paper, you mention that the sources (in either of the task settings) are places at least 8m apart. Why is this precaution taken? Do you observe unorthodox behavior in such circumstances?
How does the accuracy for the task vary when the setting is changed to account for 2 distractor sounds and 1 target source?
The paper says that you have tested with 2 distractors and 1 source but none of the splits provided have this setup in place. i.e. in every instance, there is just one source and one distractor sound. Kindly share the splits that were used to test in the multi-distractor setting.
In an environment instance, what does the "all_geodesic_distances" key specify?
In an environment instance, why is the "shortest_paths" key always null?
In an environment instance, under the "info" key, what do the "num_action", "start_idx" and "target_label" keys mean?

Thank you

SAGNIKMJR commented 1 year ago

Yes, we evaluate the effect of inter-source distance. See analysis in Supp. Sec. 7.3 2, 3. The dataset splits are available here: https://utexas.box.com/shared/static/vwrkm3kn06pobf8z6g3q3zom5ybei8oq.zip
'all_geodesic_distances' represent the geodesic distance between the agent and each of the sources, and also the inter-source geodesic distance. -1 represents the agent, 0 represents first source and 1 represents the second source. Hence, (-1, 0) is agent-to-source1, (-1, 1) is agents-to-source 2, and (0, 1) / (1, 0) is the inter-source distance.
it's a redundant field, that's why it's set to null
'num_action' is a redundant field. 'start_idx' denotes the starting index in an audio clip at which the episode starts sampling the monaural audio, but it's insignificant in this setting, as the monaural audio is always sampled from the start of the clip and hence, is set to 0. 'target_label' is the index of the target audio class for the episode

sreeharshaparuchur1 commented 1 year ago

@SAGNIKMJR

in the dataset splits that you've shared, in the 'test_nearTarget_3Sources.json' file, the all_geodesic_distances' field is never 0 for the (-1,0) configuration. This is unexpected as the near target setting involves spawning the agent at the target sound source.

Also, this split just seems to be to test the setting. Was the agent not explicitly trained for this setting? If so, why not? Wouldn't you expect the agent to perform better if it was trained for the task of separating one target source in the presence of 2 distractor sounds.

Thank you.

SAGNIKMJR commented 1 year ago

Thanks for pointing out the mistake. The correct datasets are available here: https://utexas.box.com/shared/static/pbcbi27hw669gpibw8ax76cb0ggvyr45.zip

We didn't train an agent for this setting because even without retraining, we found our agent to generalize reasonably well. However, retraining can be expected to yield even better performance.