Closed eddieantonio closed 3 years ago
Yes, the recording annotations should have a tier for marking the quality, as good, bad (and potentially in between). This wasn't done for the first month of the current annotation project, and it wasn't done in the earlier work by MegBon and Lex. Also, the students may not have been doing this consisently, and the instructions for "good" might not have been the clearest. Erin and Alyssa will be working on completing the missing ratings as well as systematizing the already existing ones.
Anyhow, that tier is intended to be used to rule out sub-optimal recordings from production presentation, by filtering out "bad" recordings. In my view, the assessment as "good" was supposed to be relative to the other pronuncations of the word/sentence by the same speaker, in identifying the "best" recording, rather than a complement to "bad".
So, in the importatation, I'd focus on filtering out those rated as "bad". It might be useful to eventually know which recordings are such that all are rated bad - but this is probably something to be done in the recording validation app - which I'd rather call the spoken database app, since it's not only for validation but for storing and annotating.
@eddieantonio @aarppe Do we know how the identification of a recording as "good" or "bad" is progressing? In the few files I've looked at so far, the only insights I can gain from the comments tier is either "careful elicitation" or "practice".
The quality classification was supposed to be a three-way one, with good
, bad
, and in between. The designation of bad
should indicate recordings that are compromised in various ways (overlap with another speaker, someone coughing, background noise, or otherwise not usable). The designation of good
or best
was a relative one when there were multiple recordings of the same word, indicating the single recording that was subjectively the best. The recordings not designated as good
or bad
should still be usable.
So, we'd primarily want to ignore and not present recordings that are designated bad
. If all the recordings for some word have been designated as bad
, then those should be flagged for re-recording. A further, secondary finesse would be to present only the good
or best
recordings, when such are available - but as said, this is a tweak we can explore to prune the number or recordings that are made available, once we have what to prune.
Also, I could imagine on the longer term that speakers/experienced instructors could be used to flag recordings which for the recording may be incorrect, e.g. the pronunciation of vowel length is wrong due to whatever reason, e.g. the speaker mistaking a minimal pair.
The quality classification was supposed to be a three-way one, with
good
,bad
, and in between. The designation ofbad
should indicate recordings that are compromised in various ways (overlap with another speaker, someone coughing, background noise, or otherwise not usable). The designation ofgood
orbest
was a relative one when there were multiple recordings of the same word, indicating the single recording that was subjectively the best. The recordings not designated asgood
orbad
should still be usable.So, we'd primarily want to ignore and not present recordings that are designated
bad
. If all the recordings for some word have been designated asbad
, then those should be flagged for re-recording. A further, secondary finesse would be to present only thegood
orbest
recordings, when such are available - but as said, this is a tweak we can explore to prune the number or recordings that are made available, once we have what to prune.Also, I could imagine on the longer term that speakers/experienced instructors could be used to flag recordings which for the recording may be incorrect, e.g. the pronunciation of vowel length is wrong due to whatever reason, e.g. the speaker mistaking a minimal pair.
The problem with this is that the words "good" and "bad" are not written in the annotation files. Instead we have "careful elicitation" and "practice".
Maybe @atticusha can weigh in? **What part of the ELAN file says we have a good recording?***
@emcgarve should know as well. We started implementing the quality assessment only after something like 1/6 to 1/5 of the materials were annotated, so the very first snippets are yet lacking that.
Hi,
So my instructions to annotators were that all the actual elicitations are rated as good, bad, or best. Some of them marked practice elicitations (where people where sort of trying some things out) in the beginning, in the same way they did it to mark language being spoken.
Basically, we can ignore anything that is not best
or good
(for Itwêwina at least). There should never have been a case where annotation files did not use these terms. If there's a case where people only marked 'practice' or 'careful' or something, we need to redo that.
@atticusha WHICH TIER?!?!?!?!?!!!?!?!?!?!?!
Do also note that the annotations done by Lex G. and Megan B. didn't include assessment of quality. The same applies to the very first annotations done by the undergrads, as they were only instructed to assess quality after about a month from the start of their work. However, they had not yet had a chance to complete much of the annotation at the very beginning, so the impact should not have been much. Anyhow, what this means is that the first annotations (maybe upto one third) didn't have quality assessment, but the following two thirds until the end should have quality assessments according to Atticus' criteria. For the ones without the assessment, we've been planning on Daniel and Alyssa filling that in eventually - Alyssa should have sorted out those sessions that were lacking quality assessments (based on what Erin told me yesterday).
The Comments tier:
This is 2015-09-30am-Track_1.eaf
:
@nienna73: you may need to scan the comments tier to find if it actually contains the expected keywords. Expect variations in spacing and letter-case. Set the quality to the given value good, best, unusable bad, and there may be a null value as well. Import all recordings, regardless of annotation quality; we can always display only the good and best recordings, but we're going to regret not importing all of the recordings.
The data in the Comments tier, from an anonymous source:
the data tells us how [:poop: ] the audio is in terms of representing the linguistic sign so if coughing or talking in background == bad if perfect clear, no echo, no audio distortion == best if mostly clear, but maybe there is a sound in the background a little, but we would still be okay presenting this to the public == good realistically, bad means don’t use this, good means use this if you must, and best means use this above other examples (from the same speaker) [...] there may be other stuff on the comment tier but ignore anything but those 3
(emphasis mine)
Yes, we want to extract all snippets regardless of quality, but then be able to rule out some - in itwêwina, and perhaps in our speech technological projects as well - due to having been judged as not really usable (=bad).
Later on, we might decide that we present only the recordings marked as best, unless there aren't any for some particular recording. But anyhow we can use here the quality assessments as extracted from the annotations as the basis for filtering the recordings.
Note! The comments tier can say many OTHER things like "Laughter", "Static", "English", "Indiscernible", "Cree?" and... much more.
Sometimes there is a hyphen and the reason, e.g.:
Bad-static
— same as "bad"Source: 2016-06-14am-DS/Track 1_0001.eaf
I think I did this.
If the word "bad" is in the lowercase comment, the recording is bad. If the words "good" or "best" are in the lowercase comment, the recording is good. Otherwise, the recording is of unknown quality.
All snippets, regardless of quality, appear here: https://speech-db.altlab.app/ But only the not bad ones make their way to itwêwina.
@nienna73 Ok - do we have stats on how many recording snippets are marked as "bad", "good", or "best", and then ones that have some other quality marking that is not one of the three aforementioned, and thus uninterpretable?
Here are the numbers as of today:
There are a total of 156 237 recordings.
Looking at the comments field of the recordings, I get: 54 743 recordings are marked as "best" 73 877 recordings are marked as "good" 3 681 recordings are marked as "bad"
This should leave 23 936 recordings of unknown quality.
HOWEVER, when I look directly at the quality that's being stored for each recording, I get these numbers: 129 074 recordings are marked as "good" (so, 'best' or 'good') 3 959 recordings are marked as "bad" 23 204 recordings are marked as "unknown"
These numbers don't match up, and I am genuinely confused as to why.
There are 454 entries that don't have the words "best" or "good" in their comments, and yet have been marked as "good" quality. There are 278 entries that don't have the word "bad" in their comments, and yet have been marked as "bad".
I'll investigate further and see if I can figure out why this is happening.
This just in: my queries were ever so slightly wrong and the numbers do, in fact, match!
There are: 3959 comments with "bad" 74213 comments with "good" 54862 comments with "best"
These numbers match the numbers from the "quality" field on recordings!
In addition, there are 8627 comments that have the word "careful" in them. I'm assuming, based on the small number of annotation files I've seen, that these comment tiers say "Careful elicitation".
If everyone agrees, I'll add "careful" to the list of words I look for and I'll mark those recordings as "good" quality.
I'm going to close this issue since we've accomplished what we initially set out to do. Anything else relating to the comments filed and/or recording quality assessments can be brought up in its own issue.
Apparently, some annotations have a comments tier that marks whether a particular recording is good or bad. Integrate this into the importer!