Closed abhi-agg closed 2 years ago
Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences.
Because I translated How are you doing?
after removing all the tags from html sentence and I get the same byte ranges for both html and non-html versions.
Attached image for reference.
Ah, I totally missed that. I was under the impression that QualityEstimation scores uses those same offsets that are stored in the AnnotatedText
class that HTML updates. It does… but it aggregates them into words and then stores its own byte offsets all before the HTML is inserted into the translation.
I'll see whether I can make the QualityEstimator
class work on token offsets instead. Then it automatically stays up to date with the byte offset updates that HTML makes. (And it would also benefit from changes like #298.)
@jelmervdl
I'll see whether I can make the QualityEstimator class work on token offsets instead.
QE uses words (not continuous subwords), I don't expect the above approach to fare well.
I have always wanted QE to use Annotation after relaxing the continuity constraints. Then we should be able to uniformly update QE "annotation" (based on words, no continuity) and plaintext vocab "annotation" (based on subwords, sentence piece guaranteed continuity).
This will break the negotiated API between QE and Mozilla, but I strongly believe is the proper way to go. Especially since #298 also points towards this. In effect, this would mean pushing down some of the "insert" tag (opening or closing) before token t
as a method on to Annotation, which automatically fixes the rest of the Annotation. While I do not know the full detail, I understand that in the current state of HTML you are editing Annotation
outside the class to accomplish what is currently happening in HTML - some of this would be moving in and contained within Annotation. This means you can fix up the QE "Annotation" the same way you fix up the plain-text annotations.
I do not expect the above to be trivial.
QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.
QE uses words (not continuous subwords), I don't expect the above approach to fare well.
My idea was to remove the subwordToWord
call here: https://github.com/browsermt/bergamot-translator/blob/2844cedb0dc8b67df5b656e61b69e194a6658501/src/translator/quality_estimator.cpp#L30
That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:
That step is something a client can easily do itself, it has access to the AnnotatedText.
Hell, I could even integrate it in a way similar to 83e869c288758ba2915862c90db384748065cc42 at that point, and have quality estimation scores be part of the output HTML. Then the client wouldn't need to do any work at all. But that all depends on the scenario in which we need both HTML and quality scores at the same time.
QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.
The whole idea of filing this issue was based on @kpu's suggestion (btw @kpu thanks a lot for this suggestion 👍🏾) in last plenary to start integrating and find bugs quickly so that technical support from involving partners could be solicited asap.
QE feature is definitely going to be a part of the final delivery and so is the HTML. In my opinion, this fix doesn't have to wait till W4. We need to test QE again after this bug is fixed and that might resurface some new issues which further might take some time to fix.
All of this goes in general for anything that is becoming a part of the final delivery. cc @andrenatal @lonnen
@jelmervdl is working on it.
My idea was to remove the
subwordToWord
call here:That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:
That step is something a client can easily do itself, it has access to the AnnotatedText.
@jelmervdl Thanks for working on it. I would add one thing. If I remember correctly, the translation models (and perhaps supervised QE models as well) provide scores of the "subwords" and then some post-processing is done to convert the "subword" scores to "word" scores in the engine. We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing. It is more efficient to do it in the engine as it has greater control and access to more information, plus avoiding replicating the same logic for every client. The client was kind of made oblivious to the concept of "subword" when the html translation came inside the engine.
Please correct me if I am wrong somewhere. Open for discussion here if keeping it the same way for html text poses a technical challenge 👍🏾
@abhi-agg @andrenatal Please take a look at #357 and #358 to express an opinion about API.
Of course QE words and HTML tags are unsynchronized i.e. a QE word can span over an HTML open or closing tag. So #358 repeats the QE score tag as necessary to cover the QE word.
We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing.
I agree with the reasoning here. In both #357 I effectively moved the subwordToWord
call to the WASM bindings part. In Javascript you'll still receive scores per "word" (where the byterange for that word can include some HTML tags as well, but never more text)
In #358 you get this behaviour for free since I assign the same <font x-bergamot-word-score="">
tag to all subwords in a word, and the HTML insertion part is smart enough to then group all subwords into the same font tag.
Open for discussion here if keeping it the same way for html text poses a technical challenge
QE has only delivered three models and they're all out of English, so there is no bidirectional QE. Which is sad because QE might have been useful in that context. However, text/plain translation remains a requirement.
There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.
There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.
I found the following underline based proposal, and a few variants in Section 5 relating to QE.
From meetings, I think it's important to discuss the following. I think @jelmervdl's render might have to do the thresholding (in C++) that was earlier meant for @abhi-agg to do to get the probability values to classes (Section 5: Major/Critical, Major/Minor/Critical, Poor/Ok). Right now we're exporting real values as data. I'm not sure if the decision rule can be applied in CSS (I've lost track of how capable it has become), @jelmervdl's currently using continuous colouring. There is the alternate option of picking the data from attributes and doing the attaching the class to the element in JS.
All mock-ups seem to assume plain text (e.g. a form input), not HTML. In that case, what is currently in main will be sufficient. #357 won't hurt, but the offsets weren't broken to begin with in this scenario.
To clarify a bit further: #357 just fixes the byte offsets in case the input is marked as HTML. If the input was text, there was no issue. So for form translation (which I assume are always plain text unless you're going to attempt auto translation in contenteditable
elements…) it is not necessary.
I think we should get #357 in to close this issue as that addresses the original post and meets our original agreement. @abhi-agg can mull the HTML offer.
Written to @abhi-agg and @andrenatal seeking comment on #357 and #358 and am coming to the notion #358 is a better option. Blocked awaiting their feedback.
I have few queries here:
Am I right?
@abhi-agg Plain text remains as it was already working. However, what I would recommend as the easiest thing for you is to follow the suggestion in: https://github.com/browsermt/bergamot-translator/issues/355#issuecomment-1044425813 specifically:
HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Then you don't have to handle byte ranges in either case.
HTML: pick #358 and you get the tags back. text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Using this approach means always setting ResponseOptions::HTML flag to true
. Right? And in that case wouldn't the engine crash for sentences like Hello < world
(e.g this issue).
HTML: pick #358 and you get the tags back. text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Using this approach means always setting HTML flag to
true
. Right? And in that case wouldn't the engine crash for sentences likeHello < world
?HTML: pick #358 and you get the tags back. text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Using this approach means always setting ResponseOptions::HTML flag to
true
. Right? And in that case wouldn't the engine crash for sentences likeHello < world
(e.g this issue).
Hello < world
is a valid HTML snippet. Hello < world
is not. That's what encode refers to.
I suppose one could always have HTML mode on then implement text/plain translation by encode -> pass as HTML -> decode. That would be less efficient though.
For QE, to render the colors, you are presumably doing HTML rendering in all cases anyway, hence the suggestion of having us generate the HTML output even if the input is (escaped) text/plain.
Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:
function encodePlainTextAsHTML(text) {
const div = document.createElement('div');
div.appendChild(document.createTextNode(text));
return div.innerHTML;
}
And then also set responseOptions.html = true.
All mock-ups seem to assume plain text (e.g. a form input), not HTML.
If plaintext (textarea, document) is being translated, I don't think there's a way to annotate it with reds or colors (https://stackoverflow.com/a/12831155/4565794) without converting to HTML first.
Suggest taking a look at Grammarly for an example. It's essentially the same problem, highlight poor confidence tokens and leave the others be. The textarea or input field is kept hidden, a contentEditable
div takes over. The rich text annotations go on the div which support HTML. The textarea or form field can be populated with just the text content.
@jerinphilip I assume that people type in the translation input field, and the quality is shown in a separate translation output field (which isn't editable when the UI is in this state…). Only once the translation is accepted you'd need to convert the translation output field to plain text to be inserted in the actual form field.
Hello < world
is a valid HTML snippet.Hello < world
is not. That's what encode refers to.
Oh. Thanks for correcting 👍🏾
Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:
function encodePlainTextAsHTML(text) { const div = document.createElement('div'); div.appendChild(document.createTextNode(text)); return div.innerHTML; }
And then also set responseOptions.html = true.
@jelmervdl With responseOptions.html = true
, the sentence Hello < world
if passed as it is will break the engine but the encoded version of it will not. Right?
@abhi-agg Yep! With responseOptions.html = true
:
Okay: Hello < world
Not okay: Hello < world
@jelmervdl Thanks for clarifying. Please correct me if I am wrong but with the approach https://github.com/browsermt/bergamot-translator/pull/358#issue-1142083849, extension will still have to parse the html translation result to show color-coded words based on thresholds.
@abhi-agg How exactly are you showing the color-coded words? Wouldn't the natural way to do that be rendering the HTML with some CSS?
I've added an example of how the output of #358 can be used to the demo page in that branch. That example treats the input field as plain text, escapes it, and renders the translation output as HTML. Then with a little bit of Javascript for thresholding and CSS I render sentence and word level quality indicators.
(Based on thresholds that have no meaning, I just wanted some thresholds that are met often enough for screenshot purposes.)
A quick experiment with wasm test page for html text translation shows weird byte ranges for words.
An example:
Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences. Is it something that needs to be fixed or am I doing something wrong at my end?
Attaching the image for detailed results.
cc @kpu @jerinphilip @abarbosa94 @felipesantosk @mfomicheva