Python tools for downloading, loading, and using the data for the PSST challenge.
Robert C. Gale, Mikala Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. 2022. The Post-Stroke Speech Transcription (PSST) Challenge. In Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference, pages 41–55, Marseille, France. European Language Resources Association.
Robert Gale, Mikala Fleegle, Steven Bedrick, and Gerasimos Fergadiotis. 2022. Dataset and tools for the PSST Challenge on Post-Stroke Speech Transcription. March. Project funded by the National Institute on Deafness and Other Communication Disorders grant number R01DC015999-04S1.
Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. AphasiaBank: Methods for Studying Discourse. Aphasiology, 25(11):1286–1307. Supported by NIH-NIDCD R01-DC008524 (2022-2027).
The data is hosted on TalkBank, and protected by password. To get the password and participate in the challenge, please complete this form.
The psstdata
tools will prompt for these credentials upon the first download. Credentials are thereafter stored in ~/.config/psstdata/settings.json
, and the data files are kept in ~/psst-data
. (Tip: you can change where data is stored in the settings.json
)
If you're not using Python, or you'd like write your data-loading code, you can download the data set directly from TalkBank. Once you have the password, head over to our resource page at TalkBank.
Conditions for using the PSST Dataset are described on the task website.
First, please note that this package was developed for and tested using Python 3.8 (MacOS and Linux), so switching to this version may serve as a workaround for some problems.
With a minimum of Python 3.? installed, psstdata
can be installed using pip
:
pip install psstdata # Install python helpers
python -m psstdata # Download `./psst-data` into your user directory (437MB on disk)
The python helpers include data loader tools. For more information, see Basic Usage.
The data retrieved by this tool is described in detail in each data pack's README file. A copy of those files is available in this repository for each of the train, valid, and test data packs. (These three files have only trivial differences.)
This tool also provides some additional resources to get you set up more quickly. These are referenced in the baseline systems, which you are certainly welcome to use as an example or a jumping off point!
(Key: python reference
— [json file]())
psstdata.VOCAB_ARPABET
— psstdata/assets/vocab_arpabet.json psstdata.VOCAB_ARPABET_JSON
(the filename for above)psstdata.ACCEPTED_PRONUNCIATIONS
— psstdata/assets/correctness.json >>> import psstdata
>>> data = psstdata.load()
psstdata INFO: Downloading a new data version: 2022-03-02
psstdata INFO: Loaded data version 2022-03-02 from /Users/bobby/psst-data
This will download data to the default directory (~/psst-data/
) and return an object of type PSSTData
, containing the train
, valid
, and test
splits:
>>> len(data.train)
2298
>>> len(data.valid)
341
>>> len(data.test)
652
And each of those sets is a PSSTUtteranceCollection
, which is a collection of PSSTUtterance
:
>>> data.train[0]
PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)
>>> data.train['ACWT02a-BNT01-house']
PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)
However, you'll basically only need four fields:
# Print the first four records in the train data
for utterance in data.train[:4]:
# The key ingredients
utterance_id = utterance.utterance_id
transcript = utterance.transcript
correctness = "Y" if utterance.correctness else "N"
filename_absolute = utterance.filename_absolute
print(f"{utterance_id:26s} {transcript:26s} {correctness:11s} {filename_absolute}")
""" utterance_id transcript correctness filename_absolute
ACWT02a-BNT01-house HH AW S Y /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav
ACWT02a-BNT02-comb K OW M Y /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT02-comb.wav
ACWT02a-BNT03-toothbrush T UW TH B R AH SH Y /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT03-toothbrush.wav
ACWT02a-BNT04-octopus AA S AH P R OW G P UH S N /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT04-octopus.wav
"""
Removing the package can be accomplished using pip:
pip uninstall psstdata
You may also want to delete the data and configs (Copy/paste rm -rf
commands cautiously, of course!!)
rm -rf ~/psst-data
rm -rf ~/.config/psstdata