Quality estimation scores

browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

http://browser.mt

Mozilla Public License 2.0

333 stars 38 forks source link

Quality estimation scores #242

Open andrenatal opened 2 years ago

andrenatal commented 2 years ago

We now have the bindings for quality estimation being exposed by the wasm module to the javascript frontend like mentioned here: https://github.com/browsermt/bergamot-translator/pull/239, but there's no documentation on how the QE UI should perform in regards of highlighting the low-confidence words for example, and what scores determine such words.

In order to proceed with the development of functionality, we need proper documentation on how the UI should behave.

kpu commented 2 years ago

Regarding how it is presented, see deliverable 1.3: https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf

Regarding how numbers map to what confidence to show, @mfomicheva ?

andrenatal commented 2 years ago

I've already asked this over email on Thursday but haven't received any response.

mfomicheva commented 2 years ago

@andrenatal I have answered by email.

ChrisBurnsOneOne commented 2 years ago

Mateo & co. suggested that QE presentation should use word level estimates with either a single critical criterion (values for major or critically low QE) or "fine grain" markup with up to 3 criterion values (Minor, Major & Critical) to be effective using coloured text to indicate these.

There are examples of text-colouring visual effects on pages 21 and 25 of the PDF at https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf

This might be useful to be able to toggle on or off as desired; if there's a lot of QE markup onscreen, the display will get visually "busy" quickly. I tried out that experiment and barely noticed the markup. Fast readers are able to read through short simple text, apprehend the meaning & quality of the translation. and make a decision themselves about the text before actually consulting the markup.

kpu commented 2 years ago

You're going to need to be more specific about which colors to use and what to do if they clash with the background.

ChrisBurnsOneOne commented 2 years ago

The examples listed on pages 21 & 25 show white backgrounds working well with blue, green and red. Currently there are no palette options in the app anyway.

kpu commented 2 years ago

Currently there are no palette options in the app anyway.

The purpose of this issue is to discuss what the app should do.

ChrisBurnsOneOne commented 2 years ago

I suggest :

-Allow the user to disable QE markup altogether. -Use coloured text for any QE markup in use. -Allow the user to choose between either 1 level (critically low; coloured red) or 3 levels (minor/green; major/blue or orange & critical/red) of QE markup. -Allow the user to change these colours according to personal preference using e.g. a 256-colour pallete if desired.

mfomicheva commented 2 years ago

I don't think there can be a sensible mapping to 3 levels with the current version of QE. So it's better to use binary markup: error vs. no-error.

ChrisBurnsOneOne commented 2 years ago

The linked study mentioned it didn't always help in user acceptance of the translation anyway, so I don't see having only error vs. no-error as a major problem. If it can also be disabled entirely, this simplifies things either way. From what I recall about the experiment in D1.3, these levels were arbitrary values as QE markup didn't exist at all when it was conducted.