Closed bagustris closed 2 weeks ago
cheers @bagustris , makes a lot of sense to me
So that would require some formal data.description per database folder? e.g in Json or Python directly? or perhaps yaml (currently looking into hydra)
@felixbur
Any solution would be fine for the first attempt. The data
directory in our nkululeko reflects recipe
or egs
(in style of Kaldi and ESPNET). It will grow larger and larger, so I think it needs automatic processing to update itself (or there is no need?)
I am just motivated by the script below,
https://github.com/SuperKogito/SER-datasets/blob/master/src/generate_files.py
Then, use GitHub-action to make an automatic run and update on every push.
Maybe Hydra provides a better solution (never use it)?
on it
done so far, please check, @bagustris , some dbs were not found, i commented them in the "make_description.ipynb" list not yet automatic
It works!
Indeed, using your code, we can simplify it to one descr.yml
file. Athough not automatic, it is simpler than current state. Example of a single descr.yml
# MSP-IMPROV Dataset
MSPIMPROV:
- name: msp-improv
- target: emotion,VAD,naturalness
- descr: English
- access: restricted
# ShEMO Dataset
SHEMO:
- name: shemo
- target: emotion
- descr: Persian
- access: public
# ESD Dataset
ESD:
- name: esd
- target: emotion
- descr: English,Chinese
- access: public
The readme in under data directory is created manually to summarize data information.
It can be generated automatically using python. input: README.md files from each dataset directory output: README.md under data directory to summmarize all suported datasets
Option for header/column:
name: dataset/directory name target: labels access: public, private, or restricted content: short description
optional number of utterance: count number of lines from train, test, dev CSV files. I don't know for audformat languages:
UA: WA: