Create python script to automate supported dataset information

felixbur / nkululeko

Machine learning speaker characteristics

MIT License

31 stars 5 forks source link

Create python script to automate supported dataset information #127

Closed bagustris closed 2 weeks ago

bagustris commented 4 months ago

The readme in under data directory is created manually to summarize data information.

It can be generated automatically using python. input: README.md files from each dataset directory output: README.md under data directory to summmarize all suported datasets

Option for header/column:
name: dataset/directory name target: labels access: public, private, or restricted content: short description

optional number of utterance: count number of lines from train, test, dev CSV files. I don't know for audformat languages:
UA: WA:

felixbur commented 4 months ago

cheers @bagustris , makes a lot of sense to me

So that would require some formal data.description per database folder? e.g in Json or Python directly? or perhaps yaml (currently looking into hydra)

bagustris commented 2 weeks ago

@felixbur

Any solution would be fine for the first attempt. The data directory in our nkululeko reflects recipe or egs (in style of Kaldi and ESPNET). It will grow larger and larger, so I think it needs automatic processing to update itself (or there is no need?)

I am just motivated by the script below,

https://github.com/SuperKogito/SER-datasets/blob/master/src/generate_files.py

Then, use GitHub-action to make an automatic run and update on every push.

Maybe Hydra provides a better solution (never use it)?

felixbur commented 2 weeks ago

on it

felixbur commented 2 weeks ago

done so far, please check, @bagustris , some dbs were not found, i commented them in the "make_description.ipynb" list not yet automatic

bagustris commented 2 weeks ago

It works!

Indeed, using your code, we can simplify it to one descr.yml file. Athough not automatic, it is simpler than current state. Example of a single descr.yml

# MSP-IMPROV Dataset
MSPIMPROV:
- name: msp-improv
- target: emotion,VAD,naturalness
- descr: English
- access: restricted

# ShEMO Dataset
SHEMO:
- name: shemo
- target: emotion
- descr: Persian
- access: public

# ESD Dataset
ESD:
- name: esd
- target: emotion
- descr: English,Chinese
- access: public