firasm / interCLASS

InterCLASS analysis code
0 stars 0 forks source link

Cannot take mean of ordinal data #11

Open firasm opened 9 years ago

firasm commented 9 years ago

Okay I did a bit of reading about this tonight and have reached a couple of conclusions

1) Our validation survey has to be exactly identical to the instrument we are trying to validate

2) The original CLASS paper did collapse the Likert scale to a 3 point scale, but...

3) The scoring was done as follows:

In scoring, neutrals are scored as neither agree nor disagree with the expert so that an individual student’s ‘% favorable’ score (and thus the average for the class) represents only the percentage of responses for which the student agreed with the expert and similarly for ‘% unfavorable’. The difference between 100% and the sum of ‘% favorable’ and ‘% unfavorable’ represents the percent of neutral responses.

In other words, what we need to do - assuming the experts pick “strongly agree” for all questions - is to look at the percentage of people that pick a ‘favourable response’

4) As we suspected, what we’ve been doing so far is wrong - we cannot take the mean or sem of ordinal data, even if it has been collapsed. There’s a short paper by Susan Jamieson (Likert scales: how to (ab)use them) that describes in detail why this is the case.

I suggest we all forget everything we've looked at and give me a chance to adjust the code to report those percentages rather than the means and sems.

On the bright side, I have absolutely no doubt that what I've done is wrong and we need to fix it befre we proceed.

Another interesting ramification of my discoveries has been that we absolutely need to decide between slider bars vs. likert scale, and then we need to stick with it now until the end off time. This is principally because once we validate it using our metric, we're stuck with it. My instinct tells me that a 5 point scale is much easier to validate because you hope that when you find an expert, they will give that question a 5. What's the analogue for a 100 point scale? Is an 85 more of an"expert" behaviour than 80 ?

chrisaddison commented 9 years ago

Thanks for update about taking mean of ordinal data (I’m pretty sure I’ve seen that in other papers!). The method you describe is how BIOL-CLASS Does it as well.

Don’t see the issue re: Slider bars vs. likert scale. The expert consensus will be the average of all expert responses. It’s not necessarily going to be 4 or 5. In fact, I wouldn’t expect it to be exactly 5. We can assume that for initial data analysis purposes but only until we have expert data.

Slider bars are no different, except we actually have a real continuum that’s possible for people to answer, as opposed to integer values. We just take the average value for the experts…

firasm commented 9 years ago

Chris, remember we can't average (I think with an interval scale like the slider bar, it is possible to average under certain conditions, let me look at it again). 

So we would need to "collapse". So if most experts picked between 85 and 100, then 85-100 becomes our "favourable response". 

But if we're we're collapsing anyway, then what's the advantage of the slider bar to begin with?

I'm not against slider bars, but I do think we need to think about how to analyze that data first.

jamescharbonneau commented 9 years ago

I've never been into slider bars. I don't see what having a continuum gets us other than the opportunity to have more "researcher degrees of freedom".

It's doesn't matter what the average of the expert responses is because the average of ordinal data doesn't mean anything. It doesn't matter if it seems like sliders give a truer average, because there is no such thing as a true average.

I think we should stay away from slider bars.

james

On Aug 26, 2015, at 9:49 AM, chrisaddison notifications@github.com wrote:

Thanks for update about taking mean of ordinal data (I’m pretty sure I’ve seen that in other papers!). The method you describe is how BIOL-CLASS Does it as well.

Don’t see the issue re: Slider bars vs. likert scale. The expert consensus will be the average of all expert responses. It’s not necessarily going to be 4 or 5. In fact, I wouldn’t expect it to be exactly 5. We can assume that for initial data analysis purposes but only until we have expert data.

Slider bars are no different, except we actually have a real continuum that’s possible for people to answer, as opposed to integer values. We just take the average value for the experts…

On Aug 26, 2015, at 2:24 AM, firasm notifications@github.com<mailto:notifications@github.com> wrote:

Okay I did a bit of reading about this tonight and have reached a couple of conclusions

1) Our validation survey has to be exactly identical to the instrument we are trying to validate

2) The original CLASS paper did collapse the Likert scale to a 3 point scale, but...

3) The scoring was done as follows:

In scoring, neutrals are scored as neither agree nor disagree with the expert so that an individual student’s ‘% favorable’ score (and thus the average for the class) represents only the percentage of responses for which the student agreed with the expert and similarly for ‘% unfavorable’. The difference between 100% and the sum of ‘% favorable’ and ‘% unfavorable’ represents the percent of neutral responses.

In other words, what we need to do - assuming the experts pick “strongly agree” for all questions - is to look at the percentage of people that pick a ‘favourable response’

4) As we suspected, what we’ve been doing so far is wrong - we cannot take the mean or sem of ordinal data, even if it has been collapsed. There’s a short paper by Susan Jamieson (Likert scales: how to (ab)use them) that describes in detail why this is the case.

I suggest we all forget everything we've looked at and give me a chance to adjust the code to report those percentages rather than the means and sems.

On the bright side, I have absolutely no doubt that what I've done is wrong and we need to fix it befre we proceed.

Another interesting ramification of my discoveries has been that we absolutely need to decide between slider bars vs. likert scale, and then we need to stick with it now until the end off time. This is principally because once we validate it using our metric, we're stuck with it. My instinct tells me that a 5 point scale is much easier to validate because you hope that when you find an expert, they will give that question a 5. What's the analogue for a 100 point scale? Is an 85 more of an"expert" behaviour than 80 ?

— Reply to this email directly or view it on GitHubhttps://github.com/firasm/interCLASS/issues/11.

— Reply to this email directly or view it on GitHub.

firasm commented 9 years ago

Mea culpa: I advocated for slider bars because I was under the mistaken impression that we'd be able to average the data because it was on intervals with an equal distance AND equal sentiment between two numbers (something that definitely doesn't apply to Likert)

screen shot 2015-08-26 at 10 16 01 am

Since we cannot answer the question "of what..." I vote we ditch the sliders as well (in fact what I think I might do if we all agree, is keep the slider bar and its benefits [more deliberate MOVING of it] but change the scale from 1 to 5 [with the appropriate text labels[)

screen shot 2015-08-26 at 10 15 30 am
jamescharbonneau commented 9 years ago

Interesting. With sliders and labels it seems a person could actually respond "agree and half".

james

On Aug 26, 2015, at 10:20 AM, firasm notifications@github.com wrote:

Mea culpa: I advocated for slider bars because I was under the mistaken impression that we'd be able to average the data because it was on intervals with an equal distance AND equal sentiment between two numbers (something that definitely doesn't apply to Likert)

Since we cannot answer the question "of what..." I vote we ditch the sliders as well (in fact what I think I might do if we all agree, is keep the slider bar and its benefits [more deliberate MOVING of it] but change the scale from 1 to 5 [with the appropriate text labels[)

— Reply to this email directly or view it on GitHub.

jamescharbonneau commented 9 years ago

There might be an argument for slider bars of they are seen as being less subjective than a Likert scale. It's my understanding that it's the subjectivity from group to group that makes the average of ordinal data useless.

Im not sure that argument exists though.

I'm out of contact for a while now. We're heading to the hospital.

james

On Aug 26, 2015, at 10:05 AM, James Charbonneau james@phas.ubc.ca wrote:

I've never been into slider bars. I don't see what having a continuum gets us other than the opportunity to have more "researcher degrees of freedom".

It's doesn't matter what the average of the expert responses is because the average of ordinal data doesn't mean anything. It doesn't matter if it seems like sliders give a truer average, because there is no such thing as a true average.

I think we should stay away from slider bars.

james

On Aug 26, 2015, at 9:49 AM, chrisaddison notifications@github.com wrote:

Thanks for update about taking mean of ordinal data (I’m pretty sure I’ve seen that in other papers!). The method you describe is how BIOL-CLASS Does it as well.

Don’t see the issue re: Slider bars vs. likert scale. The expert consensus will be the average of all expert responses. It’s not necessarily going to be 4 or 5. In fact, I wouldn’t expect it to be exactly 5. We can assume that for initial data analysis purposes but only until we have expert data.

Slider bars are no different, except we actually have a real continuum that’s possible for people to answer, as opposed to integer values. We just take the average value for the experts…

On Aug 26, 2015, at 2:24 AM, firasm notifications@github.com<mailto:notifications@github.com> wrote:

Okay I did a bit of reading about this tonight and have reached a couple of conclusions

1) Our validation survey has to be exactly identical to the instrument we are trying to validate

2) The original CLASS paper did collapse the Likert scale to a 3 point scale, but...

3) The scoring was done as follows:

In scoring, neutrals are scored as neither agree nor disagree with the expert so that an individual student’s ‘% favorable’ score (and thus the average for the class) represents only the percentage of responses for which the student agreed with the expert and similarly for ‘% unfavorable’. The difference between 100% and the sum of ‘% favorable’ and ‘% unfavorable’ represents the percent of neutral responses.

In other words, what we need to do - assuming the experts pick “strongly agree” for all questions - is to look at the percentage of people that pick a ‘favourable response’

4) As we suspected, what we’ve been doing so far is wrong - we cannot take the mean or sem of ordinal data, even if it has been collapsed. There’s a short paper by Susan Jamieson (Likert scales: how to (ab)use them) that describes in detail why this is the case.

I suggest we all forget everything we've looked at and give me a chance to adjust the code to report those percentages rather than the means and sems.

On the bright side, I have absolutely no doubt that what I've done is wrong and we need to fix it befre we proceed.

Another interesting ramification of my discoveries has been that we absolutely need to decide between slider bars vs. likert scale, and then we need to stick with it now until the end off time. This is principally because once we validate it using our metric, we're stuck with it. My instinct tells me that a 5 point scale is much easier to validate because you hope that when you find an expert, they will give that question a 5. What's the analogue for a 100 point scale? Is an 85 more of an"expert" behaviour than 80 ?

— Reply to this email directly or view it on GitHubhttps://github.com/firasm/interCLASS/issues/11.

— Reply to this email directly or view it on GitHub.