psstdata

Python tools for downloading, loading, and using the data for the PSST challenge.

Citing This Work

Robert C. Gale, Mikala Fleegle, Gerasimos Fergadiotis, and Steven Bedrick. 2022. The Post-Stroke Speech Transcription (PSST) Challenge. In Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference, pages 41–55, Marseille, France. European Language Resources Association.

Robert Gale, Mikala Fleegle, Steven Bedrick, and Gerasimos Fergadiotis. 2022. Dataset and tools for the PSST Challenge on Post-Stroke Speech Transcription. March. Project funded by the National Institute on Deafness and Other Communication Disorders grant number R01DC015999-04S1.

Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. AphasiaBank: Methods for Studying Discourse. Aphasiology, 25(11):1286–1307. Supported by NIH-NIDCD R01-DC008524 (2022-2027).

Access to the data

The data is hosted on TalkBank, and protected by password. To get the password and participate in the challenge, please complete this form.

The psstdata tools will prompt for these credentials upon the first download. Credentials are thereafter stored in ~/.config/psstdata/settings.json, and the data files are kept in ~/psst-data. (Tip: you can change where data is stored in the settings.json)

Just the data, please!

If you're not using Python, or you'd like write your data-loading code, you can download the data set directly from TalkBank. Once you have the password, head over to our resource page at TalkBank.

Usage Notes

Conditions for using the PSST Dataset are described on the task website.

Setup

First, please note that this package was developed for and tested using Python 3.8 (MacOS and Linux), so switching to this version may serve as a workaround for some problems.

With a minimum of Python 3.? installed, psstdata can be installed using pip:

pip install psstdata  # Install python helpers
python -m psstdata    # Download `./psst-data` into your user directory (437MB on disk)

The python helpers include data loader tools. For more information, see Basic Usage.

Data Packs

The data retrieved by this tool is described in detail in each data pack's README file. A copy of those files is available in this repository for each of the train, valid, and test data packs. (These three files have only trivial differences.)

Additional Resources

This tool also provides some additional resources to get you set up more quickly. These are referenced in the baseline systems, which you are certainly welcome to use as an example or a jumping off point!

(Key: python reference — [json file]())

ARPAbet symbols (and integer mappings)
- psstdata.VOCAB_ARPABET — psstdata/assets/vocab_arpabet.json
- psstdata.VOCAB_ARPABET_JSON (the filename for above)
"Correct" pronunciations for the BNT/VNT tasks:
- psstdata.ACCEPTED_PRONUNCIATIONS — psstdata/assets/correctness.json

Basic usage

>>> import psstdata

>>> data = psstdata.load()

psstdata INFO: Downloading a new data version: 2022-03-02
psstdata INFO: Loaded data version 2022-03-02 from /Users/bobby/psst-data

This will download data to the default directory (~/psst-data/) and return an object of type PSSTData, containing the train, valid, and test splits:

>>> len(data.train)

2298

>>> len(data.valid)

341

>>> len(data.test)

652

And each of those sets is a PSSTUtteranceCollection, which is a collection of PSSTUtterance:

>>> data.train[0]

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

>>> data.train['ACWT02a-BNT01-house']

PSSTUtterance(utterance_id='ACWT02a-BNT01-house', session='ACWT02a', test='BNT', prompt='house', transcript='HH AW S', aq_index=74.6, correctness=True, filename='audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav', duration_frames=12752)

However, you'll basically only need four fields:

# Print the first four records in the train data

for utterance in data.train[:4]:

    # The key ingredients
    utterance_id = utterance.utterance_id
    transcript = utterance.transcript
    correctness = "Y" if utterance.correctness else "N"
    filename_absolute = utterance.filename_absolute

    print(f"{utterance_id:26s} {transcript:26s} {correctness:11s} {filename_absolute}")

""" utterance_id           transcript                 correctness filename_absolute

ACWT02a-BNT01-house        HH AW S                    Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT01-house.wav
ACWT02a-BNT02-comb         K OW M                     Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT02-comb.wav
ACWT02a-BNT03-toothbrush   T UW TH B R AH SH          Y           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT03-toothbrush.wav
ACWT02a-BNT04-octopus      AA S AH P R OW G P UH S    N           /Users/bobby/audio/bnt/ACWT02a/ACWT02a-BNT04-octopus.wav
"""

Uninstalling

Removing the package can be accomplished using pip: pip uninstall psstdata

You may also want to delete the data and configs (Copy/paste rm -rf commands cautiously, of course!!)

Data: rm -rf ~/psst-data
Configs: rm -rf ~/.config/psstdata

PSST-Challenge / psstdata

readme