RAP-group / empathy_intonation_perc

MIT License
0 stars 0 forks source link

Methodological issues #7

Closed jvcasillas closed 3 years ago

jvcasillas commented 4 years ago

Juanjo, Kyle, Laura

items, fillers, differences between question types

juanjgarridop commented 4 years ago

-Broad focus vs narrow focus statements: Brandl et al do not really justify why they use these conditions. They only mention that this difference had not been tested in L2 learners before. In the discussion, they mention that the elicitation (for recording purposes) and incorporation of these conditions could be a limitation of the study. If we decide to keep this distinction, we'll need to find a way to justify it.

-Dialectal variation: Brandl et al mention that the fact that they used recordings from speakers of different dialects could have affected the results since different dialects may differ in intonation patterns and L2 learners may not be familiar with certain dialects. I noticed that the native speakers in the study had ceiling effects only in wh-questions (almost 100% accuracy) but identified correctly only 80% of yes/no questions and around 90% of statements. This could have something to do with dialectal differences in intonation. Again, if we decide to keep this, we'll need to find a way to justify it.

-Same words vs different words: the visual stimuli and the auditory stimuli of the wh-questions in the mismatch condition differ structurally since the beginning. Participants do not really need to pay attention to intonation to identify the mismatch. This is reflected in the RTs and may even be behind the reason why all participants were more accurate with wh-questions than with the other conditions. Laura suggested using a question and a subordinate clause (e.g., '¿Cuándo llega a Madrid?' and 'cuando llega a Madrid'), which would solve this problem.

-Fillers: Brandl et al use fillers that differ in sentence structure (semantic mismatch, first DP mismatch, last DP mismatch) which may direct students’ attention to sentence structure and away from intonation. This pattern is also reinforced by the fact that the visual and auditory stimuli in the wh-question mismatch condition are different. We need to consider this when we create our fillers so that we make sure that participants are actually making decisions based on intonation patterns.

-Fillers: There are only four fillers in the stimuli document. Do we need to create more? If our focus is mainly the mismatch condition, and we are studying whether participants identify an intonational mismatch, could we use the 'match' condition as fillers and only the mismatch as experimental? If not, we could probably include fillers that differ in one specific sound or phoneme (e.g., we use the word 'bata' in the auditory stimuli but the word 'pata' in the visual stimuli). We could possibly use this data for another article studying phoneme perception in sentence context or something like that (just an idea, idk if it makes sense).

kparrish92 commented 4 years ago

Broad focus vs narrow focus statements: Agreeing with @juanjgarridop. They seemed to mention that part of the stimuli were recorded by reading a list, but that one of the statement conditions was elicited in a line-reading type dialogue. We were not provided these materials (the "lines" to elicit narrow-focus statements), nor do we know how these conditions are different from each other (in terms of f0 contours), nor in relation to the contours of questions. If we want to include both statement (broad and narrow focus), then we need to figure out how they were recorded by Brandl et al. in particular, and we would likely need these dialogues that they used.

Dialectal Variation: I pretty much agree with Juanjo. Here's what I had written:

We should consider whether the use of intonation varies depending on the dialect of our speakers who record the stimuli. Brandl et al mention that previous work (Trimble, 2013) found that Peninsular Spanish questions and statements were disambiguated by English (I assume) speaking L2 Spanish learners more easily than stimuli provided by Venezuelan Spanish speakers, due to the similarity between intonational patterns of Northern Peninsular Spanish and English (no dialect given) in yes-no questions. Yet, the authors use 8 varieties of Spanish in their stimuli without discussing cross-dialectal differences in the use of intonation in their conditions. No predictions were made on the basis of the dialect of Spanish being spoken by the sources of the stimuli used in the study.

Same words vs different words: "Laura suggested using a question and a subordinate clause (e.g., '¿Cuándo llega a Madrid?' and 'cuando llega a Madrid'), which would solve this problem."

Love this idea. The issue can be seen in their condition "wh- question mismatch" (see table 4 in Brandl et al. below). This is one of the bigger confounds in this study, to me, since it's unclear whether the question word or intonation guided the decisions of the participants. image

Fillers: They had 64 experimental items and 64 fillers. The 64 experimental items were split into 4 conditions, (16 items per condition) and consisted of 8 matches and 8 mismatches.

This might be a silly question, but what is our justification/need for fillers?

Pressing issues regarding our stimuli:

  1. Organize the materials provided to us so that we have 16 items for each condition (I don't believe there are enough stimuli for each condition in what was provided by the previous authors).

  2. Decide how to record statement conditions - either decide for ourselves or find out how to replicate the the two statement conditions, such that they are consistently different from one another, but share the same intonation within a condition. Likewise, we ought to consider whether or not both conditions are really necessary.

  3. Decide on fillers - If we want fillers, we have 4 and we need 64. In addition to considering the potential issues of the fillers that we have, we may need to acquire the additional 60 that were missing, or create our own.

laurafdeza commented 4 years ago

I don't have much to add. Other than the stated by Kyle and Juanjo, my main concerns were:

  1. the differences in intonation across varieties, especially regarding the Caribbean varieties, as they tend to not inverse subject and verb in questions so that might affect intonation as well.
  2. the lexical cue in wh- questions. If we include those as statements, I think we could solve the problem. To record them, we might also need to do some kind of interview, such as "¿Cuándo te llamaron? Cuando llegaron a Málaga." Although, at the same time, the authors say about wh- questions: "Wh-questions differ intonationally from yes-no questions and statements in both English and Spanish, and serve as a baseline measurement of participants’ perception of intonation when complemented by morphosyntactic variation." (end of page 15).
  3. I was wondering too, in sentences such as "Daniel iba a Bolivia," whether the preposition merged with the verb or not.
  4. It would have been nice to include origin of the speaker as predictor to see differences in performance depending on the variety.
jvcasillas commented 3 years ago

Spanish varieties for stimuli

jvcasillas commented 3 years ago

(random notes)

Paradigm

Stim

NOTES

850-591-3772


kparrish92 commented 3 years ago

Overall suggestions:

Should we use null subjects exclusively, to avoid word order confounds ?

Following Juanjo’s suggestion that we could exclude the overt subject proper noun (people’s names) in the subordinate clause. We could consider using null subjects in all conditions, to avoid the potential use of word order to disambiguate questions from statements. Although all the stimuli would maintain the same word order, it’s possible that subject-verb inversion might be a preferred cue by some learners to tell the difference between questions and statements. (eg Daniel iba a Bolivia, ¿Iba Daniel a Bolivia?. I am suggesting that it’s possible that learners will make their decisions in this condition based solely on word order when there is an overt subject. If we do so, we should consider whether we want to maintain an average length of utterance by syllables/total duration.

Should we balance imperfect and present tense stimuli?

There are more past imperfect stimuli (62.5%) than present tense are. Should we consider tense as a random effect, or balance the tense by changing one of the stimuli to present from imperfect? It seems like it was done this way to keep the syllables at around 9 per utterance.

Specific stimuli:

  1. Mariano hablaba del agua

This one does not seem highly plausible without context to me, so we could consider a different noun.

  1. El niño oía el río.

This one also might also benefit from a more predictable noun.

  1. El bebé bailaba muy bien.

This seems not to be very felicitous. I don’t know many dancing babies, unless their parents make them dance by manipulating their little arms. This would be solved if we decided to omit the overt subjects.

  1. La amiga vive en Orlando.

We will not have Floridians exclusively listening to this, so the familiarity with Orlando may vary based on where participants leave. We could consider “en la casa” instead of “en Orlando”.

  1. Manuela vendía el huevo.

I agree that this one is a little weird too. The word “egg” is semantically surprising and might distract the listener from the tending to intonational cue. (Semantic buffer)

  1. Mi novio venía al lago

This read a little weird to me when it’s detached from context because of the tense. We could consider present tense.

  1. Mi abuela odia a la reina.

Potential surprising ending semantically (la reina). I think we should consider word predictability overall in these stimuli.

jvcasillas commented 3 years ago

Great observations. Thanks!

laurafdeza commented 3 years ago

I agree that we could have null subjects in wh- questions.

Personally, I don't think tense should matter much as we are not looking into morphosyntax, and I would assume that intonation is the same regardless of tense, but this is just my opinion.

Like Kyle, there are some predictability issues in terms of semantics. The experimental stimuli I find trouble with are:

juanjgarridop commented 3 years ago

I agree with both of you.

I think using null subjects could save us a lot of trouble with the word order confound. All of the Wh-questions in their stimuli have subject-verb inversion which helps listeners identify questions without even paying attention to the intonation.

Yes, some of the sentences sound a little weird in terms of semantics. I counted only 5 sentences that I think need to be slightly modified, some of which were already mentioned by Kyle and Laura. I believe this is important because we do not want semantic unpredictability to draw attention away from the intonation.

jvcasillas commented 3 years ago

https://github.com/RAP-group/empathy_intonation_perc/commit/218771eacfaa9c155c1a31cf8bbe6f64bd52564a incorporates the following changes:

jvcasillas commented 3 years ago

@laurafdeza @juanjgarridop you are both good to go now for recording the stimuli. I recommend you fork the repo, add your files, and then submit a pull request (as opposed to committing to master).

jvcasillas commented 3 years ago

Disadvantage to removing subject from wh- questions is that these stim now have less total syllables. RTs will automatically be shorter so this trumps a between condition comparison (though we still have this with response accuracy). I don't like this, but I guess it is ok since our focus is empathy and proficiency (we can still see how cont. vars. affect conditions independently).

juanjgarridop commented 3 years ago

If we want to keep the number of syllables/words equal for all conditions, we could add an adverb or a preposition phrase to the wh-questions that are too short.

Another option is what Kyle suggested on Friday; we could use null subjects in all the conditions. That way we would just remove all the subject nouns from all the items, and they all would decrease by 1 word only.

I think it is important to be able to compare between conditions.

jvcasillas commented 3 years ago

A couple of observations... I assumed that all sentences had 9 syllables but I am checking now and this is not the case (some have 7 or 8). Regarding the wh-questions, by removing the subject (and adding the wh- element) these sentences end up having the same syllable count in most cases (duh). I am going to count syllables in all items now and make some adjustments so that all have 8 (or 9, whichever will requiere less work).

jvcasillas commented 3 years ago

Ok. Pretty straightforward changes via https://github.com/RAP-group/empathy_intonation_perc/commit/69bf61f2aace4c72801c24e3c2086dddd4f459df. All targets are 8 syllables long. Obviously this doesn't mean the stim will be of the same duration, but we can rely on the random effects structure of the model to at least be informative if there are any issues.

UPDATE...

I keep coming back to this and I can't get my head around how the Brandl et al study can be remotely informative regarding between condition comparisons with stimuli of different durations (from a variety of speakers). They specifically present audio and text simultaneously. I think this is probably the reason why the mismatched conditions have such low accuracy. Their reasoning is that they didn't want working memory to be an issue ("This mitigated any potential confounding effects of variation in participants’ working memory, which may have been taxed by the retention required if stimuli were presented separately"). The problem is that hearing mismatched stimuli is also taxing/confusing. It makes me question how much their task actually measures processing of intonation. We need to talk about this on Friday. I think we should reconsider this part of their methodology. We are better equipped to evaluate individual differences in a paradigm in which there isn't a confound in the stimuli (duration). Think about this if you have time, @juanjgarridop @laurafdeza @nmrodriguez @kparrish92

juanjgarridop commented 3 years ago

Brandl et al mention: "RTs were measured starting at the onset of presentation of the stimuli." (pg. 17) If the utterances across conditions had different durations, then this might be another confound variable. As you mentioned, Joseph, stimuli duration might have interfered with the RTs. One possible way to deal with the differences in the duration of the utterances could be to measure RTs after the offset of the utterances and record responses only after the whole utterance is played. I don't know how possible this is. Or we could try to make all utterances the same length. I agree that whether their task measures processing of intonation is questionable not only because of the differences in stimuli durations and the cognitive challenge of reading and listening to mismatching stimuli, but also because of the differences in word order between certain conditions, which we have already discussed.

Another possibility for eliminating the cognitive challenge of reading and listening to mismatching information is what we considered some weeks ago. Instead of presenting visual and aural input simultaneously, we could just present aural input and have them decide whether the utterance is a question or a statement by pressing one of two buttons. We would miss the comparison between matching and mismatching conditions, but we would still get valuable information regarding whether or not they can perceive intonation differences between questions and statements, which is our main goal.

nmrodriguez commented 3 years ago

I think @juanjgarridop 's idea of just presenting aural input is a good solution. I remember discussing this and then we decided against it, I don't remember if it was just because we wouldn't be getting the comparison between the matching and mismatching conditions or if there was something else. I'll re-read the Brandl et al study and try to think of something else before Friday.

RobertEspo commented 3 years ago

Hi everyone,

I was thinking about the issues we were talking about on Friday about RT and the different varieties. I know our goal was to reproduce the original study, but I'm not too sure that including various varieties is really worth it.

I was thinking about what useful conclusions we'd be able to get from that, but I don't think anything would be too convincing. If we say something like, "It's harder to recognize wh-questions in Variety A than Variety B", how do we know it's really a unique nuclear configuration that's making that variety more difficult and not some other segmental feature? How do we know that the nuclear configuration their using is even unique to their variety? Maybe it appears in other varieties, but the other speakers just didn't produce that specific configuration for whatever reason.

Maybe just using one variety in this is better: we won't be able to say anything about cross-variety stuff, but we'll be able to use RT since the stimuli will be the same length. Having RT and accuracy scores would be interesting when related to empathy: are more empathic people able to recognize sentence types faster and/or more accurately than less empathic people? Do more empathic people respond faster but less accurately? Those sorts of conclusion would then be independent of variety; if more empathic people can recognize sentence type faster/more accurately for, say, Ecuadorian Spanish, the same will probably be extended to Spain Spanish or Puerto Rican Spanish.

I think that the cross-variety aspect is for another study. With just one speaker for each variety, I don't think any conclusion made would be very convincing, and I think that having access to RT data would be much more interesting.

Talk to you all in two weeks!