UUDigitalHumanitieslab / sasta

Annotates speech transcripts and scores them using diagnostic metrics
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

other pseudonyms #153

Open JanOdijk opened 1 year ago

JanOdijk commented 1 year ago

I encounter the following other "pseudonyms" (with their frequencies) in the reference data:

In category "profession" the common value "chirurgh" should be replaced by "chirurg"

JanOdijk commented 1 year ago

and I also encountered NAAM3. (with a period at the end, which occurs at the end of an utterance. Is this allowed?

JeltevanBoheemen commented 1 year ago
  • VOORNAAM: (this should be added to the category "person")

This is already a valid code, <prefix>NAAM. It is possible you encountered these without replacements in older versions of SASTA. A bug existed that didn't anonymise CHAT input, only Word input. See for test example utterances and their expected replacement: https://github.com/UUDigitalHumanitieslab/sasta/blob/adb553325b41ea379fee8133f74b7e21797eda42/backend/analysis/convert/tests/conftest.py#L76-L134

  • Lower case variants: (are they allowed?)

No. This could lead to incorrect replacements: Mijn voornaam is Piet -> Mijn Jan is Piet

  • NAAMOVERIG: (new category, should be added)

This is already a valid code: NAAM<suffix>. Same explanation as VOORNAAM.

  • A pseudonym with counter 5 (is this allowed?)

Not currently, easy to implement though.

In category "profession" the common value "chirurgh" should be replaced by "chirurg"

Good catch

JanOdijk commented 1 year ago

Thanks. I did not read the documentation well enough. How do you prevent that ACHTERNAAM is analysed as with prefix ACHTER and CODE NAAM? You first search for the longest CODE in a pseudonym?

JeltevanBoheemen commented 1 year ago

Indeed, longest -> shortest is checked