RoboTutorLLC / RoboTutor_2019

Main code for RoboTutor. Uploaded 11/20/2018 to XPRIZE from RoboTutorLLC/RoboTutor.
Other
7 stars 4 forks source link

1.8.9.1 end-of-story comprehension questions; original topic: READ often seems not to hear miscues -- why? #357

Open JackMostow opened 6 years ago

JackMostow commented 6 years ago

READ often fails to respond to oral reading miscues. Why?

a. The speech is not loud enough to pass the detection threshold, perhaps due to a noisy environment.

b. The ASR recognizes something but RoboTutor doesn't respond.

  1. (How) does the VERBOSE log show the ASR output?

  2. What can we do about this problem?

    a. Lower the threshold. ?: to what? -: might hallucinate speech if environment is noisy

    b. Supply the word after [5] seconds on the assumption that the kid attempted it. +: easy to implement +: solves UX problem without having to fix ASR problem -: might... keep... reading... even if kid does nothing.

  3. If we do 2b, when should it stop? a. At end of sentence. b. At end of story.

  4. At what point should it time out?

@judithodili and @amyogan - Did you observe this problem? Is it ok to auto-advance after 5 seconds?

@kevindeland - Thoughts?

Thanks. - Jack

amyogan commented 6 years ago

By miscues, you mean poor reading by the children?

JackMostow commented 6 years ago

I meant a misread word, false start, regression (rereading), or long hesitation.

amyogan commented 6 years ago

Actually, I'll let Judith respond if she has any comments, because on reflection I didn't observe many students attempting reading.

JackMostow commented 6 years ago
  1. Sadly, one deterrent to reading is that RoboTutor doesn't listen when audio is enabled in AZ ScreenRecorder. a. This problem currently afflicts all the tablets in Bagamoyo. b. Mugeta currently doesn't have this problem, because it doesn't screen-record at all. c. #299 will record session audio so we can disable audio in AZ Screen Recorder and let ASR work.

  2. Even when RoboTutor listens, it frequently fails to hear anything. Then. It. Just. Sits. There. That's what #357 is about.

  3. Amazingly, some kids get through ECHO or READ mode, by, touching, every, word.

  4. For kids who read aloud, ASR's high false rejection rate, coupled with accepting word i+1 only after accepting word i, quickly, reduce, even, fluent, readers, to, reading, one, word, at, a, time. #348 should fix this problem.

judithodili commented 6 years ago

I don't think we can lower the threshold any further. My personal sense has been that the swahili ASR is not reliable enough based on my testing to assess kids based on it. We should definitely leave the activities in so the kids get reading and speaking practice... but I'll rather we rely on multiple choice answer options @ 90% and writing @ 70% to assess kids for all tutors... For reading... as long as they make it to the end, promote them up one.... if they back out, they should repeat or something.

JackMostow commented 6 years ago

@judithodili -

  1. Do you believe that lowering the speech detection threshold is too difficult, or that it won't work, in which case based on what evidence if you haven't actually tried it?

  2. I look forward to using our multiple choice comprehension questions to assess both listening and reading. I too expect them to be more reliable than ASR, which isn't saying much ;-). But the questions aren't perfect, so I suggest a lower threshold than 90% -- say 70% or 80%.

  3. We need to distinguish the need to assess from the need to repeat. Unlike activities that teach specific content, stories (other than number, alphabet, etc.) just practice listening and reading skills. The questions are not important in and of themselves, nor are the stories. Moreover, reading or hearing the same story again doesn't have the same value as the first time around -- especially if it's repeated right away, at least in the case of fluency development. What to do?

    a. If they perform at chance, demote to an easier level or the previous story.

    b. If they bail, maybe they don't like this particular story, so skip it and proceed to the next one.

    c. If they finish but don't pass, ideally repeat, but not right away. However, we don't want to complexify our current simple student model, namely a unique cell in each matrix, by keeping track of which story to reread when. So (unless the kid picks Redo), just switch to a non-story activity by omitting story activities from the next activity menu, in order to let some time elapse before repeating the story.

    d. If they pass, proceed to the next story.

judithodili commented 6 years ago
  1. I have tested the reading tutor extensively during our internal QA testing, I have also watched the kids use the tutor in previous videos. With the current threshold, sometimes robotutor will accept a word even if I tap beside the tablet... I don't actually have to say any words. Other times, I say a word correctly over and over and it just doesnt take it. I think you are trying to find the magic combination of lowering the threshold and a reliable pass rate (which the best we can do is guess a number). If you lower the threshold even further AND lower the pass rate, at that point, you are just assessing if the kid produced a sound regardless of it it's the right word or not. I think the current threshold is fine... the user experience is not painful... but I dont think there is a reliable pass rate that we can use based on speech... 70% is definitely too high.. so if you insist of promoting them based on a value... lowering it is a good way to go. It'll be a shame if we hold students back in the matrix and cost time to advance in the curriculum because of issues related to the speech recognizer.

  2. I cannot chime in on cloze or the picture matching question. Based on discussions we have had, I think the cloze questions generated have to be thoroughly vetted by a native speaker before we can come up with a reliable threshold. I also think that picture matching using the pictures from the African storybook project need to be vetted thoroughly as well to come up with a reliable threshold. Overall, I very very strongly advocate that Filipo/Leonora/Maureen/any other creates a 10 question multiple choice question (per story) presented at the end of each story to the kids based comprehension questions similar to the EGRA questions .... I have requested this in the past severally but you have mostly ignored my requests. We can interleave cloze/picture matching/oral response questions between pages, but need comprehension questions similar to the EGRA that we can reliably assess (hence multiple choice). If we do this, we might not need to worry about vetting the other question types so much because they mostly provide practice opportunities and dont have to be 100% right, and can then rely on vetted story-level questions from native speakers to for promotion purposes.

  3. The whole group can decide on a promotion policy in the Wednesday meeting... I dont disagree with anything you said as a viable option, but more feedback would be nice.

JackMostow commented 6 years ago

@judithodili - Good answers!

Re 2.: Leonora is here and says that kids are used to open-ended questions, not multiple choice, so it's important to kid-test that form of question first at our beta sites before generating lots of them. (Of course both picture-matching and cloze are multiple choice too!). How?

a. Quickest is to send Fortunatus and Mwita multiple choice questions to kid-test. Leonora says 2-choice either-or questions would be familiar to kids, e.g. "Who fed the dog -- the rabbit or the cat?"

b. Would next quickest be a stand-alone app to present questions on a tablet?

c. Finally we can insert a prototype multiple-choice test at the end of a story to test the UI/UX as well.

Authentic practice for EGRA requires the open-ended format. The generic wh- questions are open-ended. It's fine to include some story-specific questions. At least one of the new stories already contains end-of-story questions, so let's use them!

A 10-item test is too long. EGRA only asks 3. Leonora suggests max 5 or 6.

Back to narrating....

judithodili commented 6 years ago

I agree 100% that we should have open ended questions so the kids can practice... But unlike human graders, this is difficult to grade by machine which leaves us with multiple choice for grading accuracy.

Quickest way is to send Mwita and Fortunatus some questions and answers via google doc for them to kid test.

2 choices seems fine - 5/6 questions per story seem fine as well.

--

Regards, Judith Odili Uchidiuno www.judithu.com

amyogan commented 6 years ago

Just a quick question without weighing in on everything - isn't bubble pop essentially a multiple choice, and therefore we know kids can at least do that?

JackMostow commented 6 years ago

It's multiple choice and can be configured with 2 choices, but I believe would need to be modified to ask a different question on each screen.

amyogan commented 6 years ago

In my comment, I meant to imply that I think we don't have to kid-test multiple choice as a concept, because we have evidence that they can choose between items (in bubble pop).

JackMostow commented 6 years ago

Ah. Plausible but assumes that what works for simple current tasks will also work for comprehension. Note that both the question and the respond options will need to be spoken, especially to test listening comprehension without requiring (or allowing) reading of the choices.

JackMostow commented 6 years ago

@judithodili - Questions/suggestions for your quick experiment, with apologies for any that are obvious:

  1. Text: Which story or stories to test?

  2. Modality: HEAR, because so few students have reached READ -- true?

  3. When: right after a kid encounters the test story in RoboTutor, but selected by the tester.

  4. Question types: There are various types of comprehension questions, and guides for writing them: a. Answer is explicitly stated in one sentence in the story, e.g. "Who ate the cheese?" b. Answer combines information explicit in 2 or more sentences in the story, e.g. "Who ate what?" c. Answer requires inference of information implicit in the story, e.g. "Why was the cat surprised?" d. Answer requires grasping the main idea of the story, e.g. "What is this story about?" e. Answer requires value judgment, e.g. "Who was your favorite character?" or "What did you like best about this story, and why?"

  5. Existing resources: The QUESTIONS tab of Swahili translations has some generic questions and choices narrated by Leonora. Ulani uses just the questions themselves, but you could turn them into multiple choice questions by providing text-specific choices or a subset of the listed generic choices.

  6. Purpose: What do we want to learn from this quick test? a. Do kids understand the task? b. Can they do it? c. What else?