YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

how to use my own dataset #115

Open wlssyuu opened 12 months ago

wlssyuu commented 12 months ago

I tried to follow readme and train my own dataset, but I could not. If I'm not bothering you, let me know how to use my own dataset. My json file is structured like this with 62 classes of datas.

{
    "data_root": "root",
    "labels": [
        {
            "file": "move_f82f063c-2fe7-43ba-afcc-626fe0266a90_0.wav",
            "augment": [],
            "noise": {
                "type": "TV",
                "intensity": 20
            },
            "age": 63,
            "gender": "M"
        },
        {
            "file": "night_bd714e10-c0f0-40e5-839f-943e29457b25_2_sl.wav",
            "augment": [
                "slow",
                "low"
            ],
            "noise": {
                "type": "caffe1",
                "intensity": 10
            },
            "age": 69,
            "gender": "M"
        },
        {
            "file": "fire_59ac1e09-b394-41a6-a134-ef7f6a95d9cc_1_l.wav",
            "augment": [
                "low"
            ],
            "noise": {
                "type": "inside_vehicle",
                "intensity": 20
            },
            "age": 64,
            "gender": "M"
        },
        {
            "file": "phone_bdd29e8d-0d65-4c72-96d3-cb438adad241_0_sh.wav",
            "augment": [
                "slow",
                "high"
            ],
            "noise": {
                "type": "TV",
                "intensity": 10
            },
            "age": 66,
            "gender": "M"
        },
}
p4vlos commented 10 months ago

Hi @wlssyuu,

According to the documentation, you should set your JSON file to the same structure as @YuanGongND 's JSON file.

Here is the author's JSON file example:

 # this is just an sample, if you only use audio, 'video_id' and 'image' entries are not necessary.
 {
    "data": [
        {
            "video_id": "--4gqARaEJE",
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--4gqARaEJE_0.000.flac",
            "image": "/data/sls/audioset/data/images/eval/_/_/--4gqARaEJE_5.000.jpg",
            "labels": "/m/068hy,/m/07q6cd_,/m/0bt9lr,/m/0jbk"
        },
        {
            "video_id": "--BfvyPmVMo",
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--BfvyPmVMo_20.000.flac",
            "image": "/data/sls/audioset/data/images/eval/_/_/--BfvyPmVMo_25.000.jpg",
            "labels": "/m/03l9g"
        },
        {
            "video_id": "--U7joUcTCo",
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--U7joUcTCo_0.000.flac",
            "image": "/data/sls/audioset/data/images/eval/_/_/--U7joUcTCo_5.000.jpg",
            "labels": "/m/01b_21"
        },
        {
            "video_id": "--i-y1v8Hy8",
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--i-y1v8Hy8_0.000.flac",
            "image": "/data/sls/audioset/data/images/eval/_/_/--i-y1v8Hy8_4.500.jpg",
            "labels": "/m/04rlf,/m/09x0r,/t/dd00004,/t/dd00005"
        },
        {
            "video_id": "-0BIyqJj9ZU",
            "wav": "/data/sls/audioset/data/audio/eval/_/0/-0BIyqJj9ZU_30.000.flac",
            "image": "/data/sls/audioset/data/images/eval/_/0/-0BIyqJj9ZU_35.000.jpg",
            "labels": "/m/07rgt08,/m/07sq110,/t/dd00001"
        }
    ]
}
YuanGongND commented 9 months ago

@p4vlos thanks so much for the clarification!

moon-aver commented 5 months ago

@wlssyuu hello. I am curious if you have solved this problem and run the AST code using the desired dataset. Also, I would like to use the dataset structure described in json. Can you share which dataset you used?