OpenNLPLab / AVSBench

[ECCV 2022] Official implementation of the paper: Audio-Visual Segmentation
Apache License 2.0
443 stars 41 forks source link

What is the difference among Single Sound Source Segmentation (S4), Multiple Sound Source Segmentation (MS3), and Audio-Visual Semantic Segmentation (AVSS)? #18

Closed Ako-r closed 1 year ago

Ako-r commented 1 year ago

Same as the title.

jasongief commented 1 year ago

Hi, thanks for your interest.

S4, MS3, and AVSS are different settings of the AVS task that aims to segment the sounding object(s). S4 setting studies the simpler case of a single sound source (sounding object) in the video, whereas there are multiple sounding objects in the MS3 and AVSS settings. In S4 and MS4 settings, the sound source are segmented through binary map to denote the location corresponding to the sound, while AVSS needs to further judge the category of the sounding objects.

Please refer to our arxiv paper for more details about these settings and the proposed AVSBench dataset: https://arxiv.org/abs/2301.13190

Ako-r commented 1 year ago

Hi! Thanks for your reply.

I also want to know how to obtain the following files.

cfg.DATA = edict() cfg.DATA.ANNO_CSV = "../../avsbench_data/Single-source/s4_meta_data.csv" cfg.DATA.DIR_IMG = "../../avsbench_data/Single-source/s4_data/visual_frames" cfg.DATA.DIR_AUDIO_LOG_MEL = "../../avsbench_data/Single-source/s4_data/audio_log_mel" cfg.DATA.DIR_MASK = "../../avsbench_data/Single-source/s4_data/gt_masks"

Looking forward to your feedback. Thanks again. m18321681162_1

@. | ---- Replied Message ---- | From | @.> | | Date | 7/27/2023 09:06 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [OpenNLPLab/AVSBench] What is the difference among Single Sound Source Segmentation (S4), Multiple Sound Source Segmentation (MS3), and Audio-Visual Semantic Segmentation (AVSS)? (Issue #18) |

Hi, thanks for your interest.

S4, MS3, and AVSS are different settings of the AVS task that aims to segment the sounding object(s). S4 setting studies the simpler case of a single sound source (sounding object) in the video, whereas there are multiple sounding objects in the MS3 and AVSS settings. In S4 and MS4 settings, the sound source are segmented through binary map to denote the location corresponding to the sound, while AVSS needs to further judge the category of the sounding objects.

Please refer to our arxiv paper for more details about these settings and the proposed AVSBench dataset: https://arxiv.org/abs/2301.13190

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>