Open andrenatal opened 2 years ago
Regarding how it is presented, see deliverable 1.3: https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf
Regarding how numbers map to what confidence to show, @mfomicheva ?
I've already asked this over email on Thursday but haven't received any response.
@andrenatal I have answered by email.
Mateo & co. suggested that QE presentation should use word level estimates with either a single critical criterion (values for major or critically low QE) or "fine grain" markup with up to 3 criterion values (Minor, Major & Critical) to be effective using coloured text to indicate these.
There are examples of text-colouring visual effects on pages 21 and 25 of the PDF at https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf
This might be useful to be able to toggle on or off as desired; if there's a lot of QE markup onscreen, the display will get visually "busy" quickly. I tried out that experiment and barely noticed the markup. Fast readers are able to read through short simple text, apprehend the meaning & quality of the translation. and make a decision themselves about the text before actually consulting the markup.
You're going to need to be more specific about which colors to use and what to do if they clash with the background.
The examples listed on pages 21 & 25 show white backgrounds working well with blue, green and red. Currently there are no palette options in the app anyway.
Currently there are no palette options in the app anyway.
The purpose of this issue is to discuss what the app should do.
I suggest :
-Allow the user to disable QE markup altogether. -Use coloured text for any QE markup in use. -Allow the user to choose between either 1 level (critically low; coloured red) or 3 levels (minor/green; major/blue or orange & critical/red) of QE markup. -Allow the user to change these colours according to personal preference using e.g. a 256-colour pallete if desired.
I don't think there can be a sensible mapping to 3 levels with the current version of QE. So it's better to use binary markup: error vs. no-error.
The linked study mentioned it didn't always help in user acceptance of the translation anyway, so I don't see having only error vs. no-error as a major problem. If it can also be disabled entirely, this simplifies things either way. From what I recall about the experiment in D1.3, these levels were arbitrary values as QE markup didn't exist at all when it was conducted.
We now have the bindings for quality estimation being exposed by the wasm module to the javascript frontend like mentioned here: https://github.com/browsermt/bergamot-translator/pull/239, but there's no documentation on how the QE UI should perform in regards of highlighting the low-confidence words for example, and what scores determine such words.
In order to proceed with the development of functionality, we need proper documentation on how the UI should behave.