Open kpu opened 3 years ago
Isn't high-oov rate pages a tiny fraction of the pages intended for the target audience, or even the web? Moreover, is this not best handled at browser lang-detect? They send us fragments of text, where a user has configured translate language x to language y, and our inputs are (almost) guaranteed to be of language x. Do we need to complicate within bergamot-translator for this?
In general there can be an OOV like this in the middle of a sentence. The langid would be as correct as it can be. If there is an OOV token in the input we should seek to copy its contents to the output.
All this repository does is add a layer on top of raw Marian with sentence splitting. This is the right place to do it.
Reading this as simply implement a "replace unknown in target with max matching piece or something from source text using alignments", still unsure where to draw the "if only high oov rate" switch from, is it supplied from outside? Do we compute it with SentencePiece and make a switch internally? Can't we just do this for all OOV?
If there's an unknown in source, chances are there's an unknown in target and the alignments between source and target unknown are bijective (Is this a correct assumption? There should be corner cases?)
So build translated text and alignments together, replace the existing decoded string with a new decoded string, and updating the decoded string ByteRanges accordingly.
so inject a transform
somewhere in code, where transform(vanillaDecoded, vanillaDecodedByteRanges, alignments) = decoded, decodedByteRanges
.
From the following output I get, unsure where to begin, any pointers from marian experts? Are the '.' <unk>
? Here's a sample output feeding in OOV for the de-en model with alignment information printed:
A. Replace <unk>
from source capability, irrespective of target being <unk>
or not.
encodeWithByteRanges
has access to surface-text, which means it can have the capability to distinguish between different <unk>
in a source line. For a first implementation, we can just keep using the existing handling of <unk>
.decodeWithByteRanges
is perhaps the best place to put this. Alignments are available after Beam-Search decoding, where the data is still marian::Words
. The decoded words will have <unk>
eventually when we sort out the data cleaning. The words are supplied to decodeWithByteRanges
, which we will extend to have (optional) source-raw string-views corresponding to words and HardAlignment
extracted after BeamSearch decoding as arguments. With these two additional arguments, we can use HardAlignment
to i) resolve which of the source <unk>
aligns to a decoded unit, and ii) replace this unit (<unk>
or not) in decoded surface with unnormalized raw surface from the source text. Implementation wise, we're additionally accepting source-text and alignments in this function now and using it inside to accomplish the desideratum. The above will work for even when the NMT systems are not trained with a lot of <unk>
in the training data (hopefully, to some extent). Which means it should work for when punctuations are output, not just <unk>
as with previous examples in this thread. Potential errors will happen when the decoder LM fills in the blanks for some <unk>
with true translation learning from some context surrounding it, which we will substitute back with probably matching source text. I think this is a reasonable trade-off and the first part of solving this.
Far as I understand, spm_encode
used in student training with guided-alignments is the text-based one, which means fast_align
is run on corpus with raw text representation of what would be <unk>
the way marian sees it. Alignments are therefore learnt the way we want them to be (distinguishing between different <unk>
through alignment, given fast_align
works), even now. Hence fast_align
and guided-alignment already prepares the learning of copy-task of <unk>
within the network (if data is not super clean).
B. Emojis
There was some discussion surrounding solving this as an emoji pass-through problem. Also something around
distinguishing multiple emojis, dealing with class imbalances etc. I think we can introduce a placeholder mechanism for a copy-task, extending A, but still contained in the two functions. Let there be N
control symbols denoting "placeholders" marked for copy. In this case, we only need to contain a mechanism within encodeWithByteRanges
which assigns different placeholders to different surface text. The assignment can be such that we uniformly sample (with replacement) k
"placeholders" in a line from the N
, which will deal with class imbalance among emojis (or any unknown for that matter), and give the network a richer notion of what and how to copy? Then put the same placeholder in the target where it appears in the training data. Just <unk>
is an instance of this case with N=1
, control symbol being <unk>
. There are now potentially two sources of truths to do OOV replacement, alignment and which "placeholder" maps to which "placeholder".
Implementation Sketch
decodeWithByteRanges
) here probably in ResponseBuilder
(construct Alignments, then construct translated text using corresponding source sentence lines and HardAlignment
s). <unk>
, going easy on the cleaning? <unk>
I need SentencePiece, which brings in libmarian so this is not standalone?
b. Reserve N
symbols in sentencepiece training for OOV placeholders indicating copy? Would it be apt to generalize this further and just reserve some N
symbols for all training here which gets assigned meaning later (WMT de-en used a bunch of misc tags standard in vocabulary).@kpu I will go ahead and try to implement 1, 2 bringing them in respective repositories by Monday so there's something concrete to take forward. I'll wait for inputs/comments on 3 and 4.
This is the en -> de student model.
```py config = { "models": [os.path.join(BERGAMOT_ARCHIVE, "model.intgemm.alphas.bin")], "shortlist": [os.path.join(BERGAMOT_ARCHIVE, "lex.s2t.bin"), True, 50, 50], "vocabs": [ os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), ], "ssplit-prefix-file": os.path.join(BERGAMOT_ARCHIVE, "nonbreaking_prefix.en"), "max-length-break": 128, "mini-batch-words": 1024, "workspace": 128, "skip-cost": True, "cpu-threads": 40, "quiet": True, "quiet-translation": True, "gemm-precision": "int8shiftAlphaAll", "alignment": True, "allow-unk": True, "log": "unk-analyis.log", "log-level": "debug" } ```
What unknowns in source maps to in target, analyzed across MTNT dataset on en
source data in train.en-fr.tsv
. Top 50 occurrences in target.
Conclusions:
<unk>
on source and target side) in training. This however is compute-heavy and all models will need retraining and updating. 😘 😂 ” 🤣 😋 ’ “ ^ > 🍻 👌🏻 😍 👍 😊 😃 Ó ¯ ツ ~~ 💣💣💣 } 😓 😁 🕍🐣 👌 😂👌 Á 😉 😍😍😍 😂😂 😍💋💋 🔥🔥 ❤❤❤ ~ 🎶 👋 ༽つ ༼ つ 🧠 🙋🏻 ❤️ 😞 💖 £ 💪👍 👎 💋 ´ 🌚 😎 🎵 👿 🌊 ‘ 💰 ⭐️ 😘😘 👍🏽 ś ❤️💙 🎈 📈 📞 ͡ ͜ʖ 🙄 👀 🙋🏻♂️ 😭😭😭😂💔 ️ ♥️ 😑 — 😭 🖕 💕 🔥 🤦🏾♀️ 😅 🐰🐰🐰 😀 😳 😄 😰 ✠ ✓✓ 😎😎 ^^^ 🇧🇷 č 😂😂🤣🤣 🅱 😏 🤤 🤔 🌈🌈🌈 🍝 😬 🅱🅱 😊😉 É 🤓 💩 💀🎺 ⚾️ 💜 🗿 🔥🔥🔥 Λ Ξ ★ ` 😭🙃 😍😍 🙂 † 🇺🇸 😢 😤😤😤😫😫 😳😳😳 💔💔 ∞ 🙌 💋💋 ÷ 😲😲 😜 😷 🤗 😭😩 😂😭 😛 >> 💌 😭💜 ⚠️ ❤ 😔 🍖 ⚽️ 😭😭 🌀 Ñ 😪 ☦☦☦ 😘🍆💦 😂😂😂😂😂😂 🅱️ 😂😂😂 🧐 ¿ 😂😁 💪 ✨ „ ê 🤔🤔 😟😟 😕 ☠ 😐 😩 ✍🏾 🌸 🙄🙄 ➕ ⃣ 👌👌👌 ^^^^ 😒 ñ 🚨 🧚🏼♀️💗 🍆 🙃 – 😍🧜🏻♀️ >>> 😋🐰🌿 😈💦 😣 🔥😍😜 🤮🤮🤮🤮🤮🤮🤮 ಠ 🍔 😂😂😂😂😂😂🤣🤣🤣🤣🤣🤣 😌 😆 😂” « » 👌🏼 👌🙄 كيس × 🇱🇷 🖤 🤷 😝😈💧 😫💦 ❓ 💹 😍🍬 😫 😤 🚊 ♥ 😂😂😂😂 ^^ 😆🤣 💋💋💋 🤔🤔🤔 😶 🎵“ ”🎵 🍵💕 👌👌 🤪🤪 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😩 👌😎 🔆 }} 😡 ł 😂💯 😂😂🤣🤣🤣😂💯 😁😁😁 😇 🚩🚩🚩🚩🚩 ï ♡ 😩😩 👍🏻😊 😋😋 🖕🏿 ͛ ⬁ 🤐🤐 🤔😳 💙🤓 ♬ 😈😤 ¢ 🤰🐽🍣 >” 🎁 🦄 🤷♀️ 🙁 `~ 🔊 🤢 皮 牛革 😂🤣 ”” 🤔😍 😩😩😩😩 😩😩😩 🕵️♂️ ~~~~~ 🐸 ⚾ 💥💥💥💥 😝🤮 ^^^^^ 👌🏽 ^^^^^^^^ 🍴 😂🔥 >` 😂🖕 🙈 🎩 【ブレフロ】【 】 【 😅😅😅 💚 😊🤘☠️ 😉😈😈 ✔️ ð 😯 θ ≈ ř 終劇 ✊ ∫ π ≠ ⏭️ ⏮️😜😝😜 💙 `` 🍹🍸🥃🍷🥂🍻🍺 🍺🍻🥂🍷🥃🍸🍹 ☹ 🙄😂 ☀️ ✅ ♪ ♫ 🙆🏽♂️🙅🏽♂️🤷🏽♂️😎 ― ⚡ ç “😂” ☺ 🙏🏼 ☹️ 🤷🏻♀️ 😈 😴 😘😊 😂😂😥 🚯🚫👌💰😄👌🎠 👌👌👌👌👌 ø ✔ 😎🐃✌🏾 ☺️ Õ Áìú 🕟 🕑 🤣🤣 😅😐😞 ”— 김세연 세 새 게 개 ㅔ ㅐ ‽ Ż 🤣🤣🤣 ``` ú 🙄🙄🙄 Θ ’” Ω 😭😘 💕💕💕 😔🤦🏾♂️ 🤷🏼♀️ ️♀️ 😅😂 ♥♥ î ô œ Ê
On the larger MTNT monolingual en data
@jelmervdl Can you take a look, when you have some time at this issue and https://gist.github.com/jerinphilip/439ba3b25cdd0d8727b0c80956340024? This was a crude something I got to check if an <unk>
can be replaced by a single token in target-text finding where it maps to. I believe your insights based on the experience of doing HTML tag transfer should be of great value here.
The query I'm trying to put forth here is - with a refined HTML tag-transfer API, this problem should be the same as "insert pseudo-tags around an emoji, find the matching range in target and copy-contents over from source-text". Are there existing functions/primitives that can be used here, if so could you point to those? Specifically, is there a library function that for source and target, the alignment matrix can aggregate tokens and provide a max-overlap-span, or something similar? If you look at the naive replacement gist, what's happening is there are multiple (punctuation) tokens and text is getting mapped to that.
Wikipedia has names of languages in their own language on the left navbar https://en.wikipedia.org/wiki/Machine_translation like these: العربية Español हिन्दी Bahasa Indonesia Bahasa Melayu Português Русский اردو 中文
Problem is that the, say, German model has no clue what to do with Arabic text input. So it translates as "-".
More generally, we could treat OOVs like tags/emojis and pass them through on an alignment basis. SentencePiece does tell us about OOVs.
There is the more general problem of text that is in the vocabulary but wrong language / doesn't make sense.