Passthroughs if OOV rate is very high

kpu commented 3 years ago

Wikipedia has names of languages in their own language on the left navbar https://en.wikipedia.org/wiki/Machine_translation like these: العربية Español हिन्दी Bahasa Indonesia Bahasa Melayu Português Русский اردو 中文

Problem is that the, say, German model has no clue what to do with Arabic text input. So it translates as "-".

More generally, we could treat OOVs like tags/emojis and pass them through on an alignment basis. SentencePiece does tell us about OOVs.

There is the more general problem of text that is in the vocabulary but wrong language / doesn't make sense.

jerinphilip commented 3 years ago

Isn't high-oov rate pages a tiny fraction of the pages intended for the target audience, or even the web? Moreover, is this not best handled at browser lang-detect? They send us fragments of text, where a user has configured translate language x to language y, and our inputs are (almost) guaranteed to be of language x. Do we need to complicate within bergamot-translator for this?

kpu commented 3 years ago

In general there can be an OOV like this in the middle of a sentence. The langid would be as correct as it can be. If there is an OOV token in the input we should seek to copy its contents to the output.

All this repository does is add a layer on top of raw Marian with sentence splitting. This is the right place to do it.

jerinphilip commented 3 years ago

Reading this as simply implement a "replace unknown in target with max matching piece or something from source text using alignments", still unsure where to draw the "if only high oov rate" switch from, is it supplied from outside? Do we compute it with SentencePiece and make a switch internally? Can't we just do this for all OOV?

If there's an unknown in source, chances are there's an unknown in target and the alignments between source and target unknown are bijective (Is this a correct assumption? There should be corner cases?)

So build translated text and alignments together, replace the existing decoded string with a new decoded string, and updating the decoded string ByteRanges accordingly.

so inject a transform somewhere in code, where transform(vanillaDecoded, vanillaDecodedByteRanges, alignments) = decoded, decodedByteRanges.

From the following output I get, unsure where to begin, any pointers from marian experts? Are the '.' <unk>? Here's a sample output feeding in OOV for the de-en model with alignment information printed:

Hindi fed through en->de

``` [original]: सोशल मीडिया पर अक्सर ऐसे वीडियोज देखने को मिल जाते हैं, जिन्हें देखने के बाद हमें अपनी आखों पर ही विश्वास नहीं होता और कुछ वीडियोज तो ऐसे भी होते हैं, जिन्हें देखकर हम हैरान रह जाते हैं. इन दिनों सोशल मीडिया पर एक ऐसा ही वीडियो वायरल हो रहा है, जिसे देखकर आप हैरान रह जाएंगे. क्या आपने कभी ऐसा सुना या देखा है कि किसी कुत्ते या किसी भी जानवर को चोट लगी हो और वह खुद ही अपना इलाज कराने डॉक्टर के पास पहुंच गया हो ? आप सोच रहे होंगे कि भला की कुत्ता खुद डॉक्टर के पास इलाज के लिए कैसे जा सकता है. लेकिन, अब सोशल मीडिया पर जो वीडियो वायरल हो रहा है, इसमें ऐसा ही कुछ देखने को मिला, जिसे देखने के बाद हर कोई हैरान है. [translated]: " - . . . . . . . . . . . . . . . - - - . . . . . . . . . . . . . . . . "-", . . . . . . . . . . . . . . . . . . - - - - - - - - - - - . . . . . . . . . . . . . - - . . . . . . . ., . . . . . . . . . " - - . . . . . [src Sentence]: सोशल मीडिया पर अक्सर ऐसे वीडियोज देखने को मिल जाते हैं, जिन्हें देखने के बाद हमें अपनी आखों पर ही विश्वास नहीं होता [tgt Sentence]: " - . . . . . . . . . . . . . . Alignments : "(0.956004) सोशल: -(0.583375) : -(0.228428) (0.342378) मीडिया: (0.215835) .(0.353554) : (0.271081) (0.397723) पर: .(0.37035) .(0.286085) : (0.237772) (0.22907) अक्सर: .(0.214375) : (0.366927) (0.20181) ऐसे: .(0.211806) : (0.291449) वीडियोज: .(0.213809) .(0.272131) : (0.244371) (0.326145) देखने: .(0.234805) .(0.222403) : (0.263701) (0.307984) को: : (0.235474) (0.230634) मिल: .(0.282827) .(0.298355) : (0.326475) (0.268782) जाते: .(0.217554) .(0.287406) .(0.239472) : (0.251882) (0.47677) (0.560619) (0.591466) (0.38006) हैं: .(0.248821) .(0.34244) .(0.376725) ,: .(0.237691) (0.323229) .(0.29143) (0.244308) : जिन्हें: : (0.287278) (0.326648) देखने: .(0.248956) : के: .(0.38824) .(0.486206) .(0.238963) : बाद: .(0.228308) : (0.218769) हमें: .(0.259913) : अपनी: : आखों: : पर: : ही: : विश्वास: : नहीं: : होता: : Quality: whole(630.287), tokens below: "(16.4854) -(14.8093) (14.9548) .(20.146) (17.9222) .(22.6871) (18.5127) .(22.6934) (19.2381) .(22.9869) (19.7148) .(22.6242) (19.8084) .(22.2846) (19.5233) .(22.6654) (19.5366) .(22.8989) (18.8636) .(22.395) (19.1529) .(23.0785) (17.9962) .(23.3846) (18.0284) .(23.1868) (19.1584) .(23.0204) (19.5491) .(23.3447) (19.6364) [src Sentence]: और कुछ वीडियोज तो ऐसे भी होते हैं, जिन्हें देखकर हम हैरान रह जाते हैं. इन दिनों सोशल मीडिया पर एक ऐसा ही वीडियो [tgt Sentence]: . - - - . . . . . . . . . . . . . . . . Alignments : (0.818588) और: .(0.357205) : -(0.442216) -(0.2517) कुछ: .(0.50503) : -(0.21589) -(0.31923) वीडियोज: : -(0.426235) (0.49468) (0.208258) तो: .(0.349761) : (0.245715) (0.234698) ऐसे: .(0.329014) .(0.263438) : (0.279558) (0.350182) भी: .(0.248584) .(0.226412) : (0.263694) (0.303624) होते: .(0.240984) .(0.277225) : (0.32832) (0.315904) हैं: .(0.305526) .(0.356893) ,: (0.32313) .(0.213539) (0.308022) .(0.216299) : जिन्हें: : (0.295019) (0.202798) देखकर: : (0.283703) (0.404502) (0.302214) हम: .(0.216913) : हैरान: .(0.206387) : (0.204113) (0.272782) रह: .(0.207866) : जाते: : हैं: .(0.22433) .(0.246221) .: .(0.314097) .(0.262107) : (0.279113) (0.456464) (0.332618) इन: : (0.213202) (0.225647) दिनों: .(0.251294) .(0.326989) .(0.255369) .(0.206421) : (0.24916) (0.254673) (0.21884) सोशल: .(0.276874) .(0.301296) .(0.302771) .(0.201276) : मीडिया: .(0.261903) .(0.342812) : (0.206704) पर: : एक: : ऐसा: : ही: : वीडियो: : Quality: whole(787.333), tokens below: (16.1394) .(20.1153) -(16.6228) -(15.2335) -(15.2523) (15.3559) .(21.3285) (18.3464) .(23.2595) (18.4764) .(23.516) (18.7783) .(24.0205) (17.8907) .(23.9786) (17.6468) .(24.4881) (18.5288) .(23.7913) (19.1997) .(23.1533) (19.8503) .(23.3992) (19.6372) .(24.0604) (19.4911) .(25.7733) (19.1718) .(25.4937) (18.9104) .(24.5971) (19.6162) .(24.7096) (19.2933) .(24.2421) (19.5093) .(24.3504) (20.1053) [src Sentence]: वायरल हो रहा है, जिसे देखकर आप हैरान रह जाएंगे. क्या आपने कभी ऐसा सुना या देखा है कि किसी कुत्ते या किसी भी जानवर [tgt Sentence]: "-", . . . . . . . . . . . . . . . . . Alignments : "(0.652248) वायरल: "(0.336628) -(0.43808) : -(0.371416) "(0.40017) हो: "(0.434341) : ,(0.384477) रहा: : ,(0.266007) है: ,: : (0.322394) जिसे: (0.221961) .(0.305329) .(0.254102) : (0.387028) (0.423057) (0.212407) देखकर: .(0.218747) .(0.217672) : (0.304233) (0.303549) आप: .(0.218633) .(0.240343) : (0.298503) (0.325131) हैरान: .(0.214112) : (0.235449) रह: : जाएंगे: .(0.244301) .(0.313342) .: .(0.223318) .(0.214501) : (0.436911) (0.546866) (0.531091) (0.30733) क्या: : आपने: .(0.22263) .(0.372788) .(0.34158) .(0.228975) : (0.315708) (0.439662) (0.468507) (0.26392) कभी: .(0.283864) .(0.312371) .(0.221311) : (0.24852) ऐसा: .(0.234662) .(0.350711) .(0.295482) : (0.259843) (0.339373) (0.213037) सुना: .(0.218174) : (0.280702) (0.248068) या: .(0.305921) .(0.346372) .(0.298072) : (0.22789) (0.241198) (0.220623) देखा: .(0.20034) .(0.262666) .(0.241539) : (0.204948) (0.213071) है: .(0.231243) .(0.246419) : (0.200413) (0.23977) कि: : किसी: : कुत्ते: : या: : किसी: : भी: : जानवर: : Quality: whole(815.831), tokens below: "(15.8473) -(15.0194) "(14.9181) ,(15.3991) (14.4125) .(20.5192) (18.1332) .(23.7773) (19.1696) .(23.7487) (19.8035) .(24.1198) (19.3437) .(24.3266) (18.9474) .(24.8438) (18.6672) .(25.1848) (19.4212) .(24.9913) (18.9076) .(23.9891) (19.5547) .(23.8871) (19.3716) .(23.5935) (19.8447) .(23.7333) (20.2047) .(23.6844) (20.4975) .(23.8046) (20.3857) .(24.0869) (20.1558) .(24.126) (20.6953) .(23.5984) (21.1165) [src Sentence]: को चोट लगी हो और वह खुद ही अपना इलाज कराने डॉक्टर के पास पहुंच गया हो ? आप सोच रहे होंगे कि भला की कुत्ता खुद [tgt Sentence]: . - - - - - - - - - - - . . . . . . . . . . . . Alignments : (0.692652) को: (0.283539) .(0.644318) : -(0.469442) -(0.272727) चोट: .(0.268899) : -(0.243767) -(0.426045) -(0.469727) लगी: : -(0.239669) हो: : -(0.315251) -(0.442719) -(0.449794) -(0.306429) और: : -(0.223187) -(0.240647) वह: : -(0.272017) -(0.378221) -(0.261925) खुद: : -(0.231067) ही: : -(0.278199) -(0.390146) -(0.380319) (0.256857) अपना: : (0.203615) इलाज: : (0.302454) (0.32013) कराने: : (0.211737) (0.235591) डॉक्टर: : के: .(0.208412) : (0.282954) (0.281797) (0.206605) पास: : (0.210017) (0.20882) पहुंच: .(0.212868) .(0.264379) .(0.240117) : (0.246689) (0.321604) (0.30002) (0.233927) गया: .(0.21623) .(0.282752) .(0.248528) : (0.276151) (0.312417) (0.231658) हो: .(0.214711) .(0.362975) .(0.478412) .(0.481597) .(0.259972) ?: : आप: : (0.233691) (0.230496) सोच: .(0.235191) : (0.259413) (0.254464) रहे: .(0.330843) .(0.432209) : होंगे: : कि: : भला: : की: : कुत्ता: : खुद: : Quality: whole(735.547), tokens below: (16.4182) .(20.4976) -(17.6592) -(15.3145) -(15.6185) -(15.6116) -(15.7283) -(16.1119) -(16.3419) -(15.7527) -(15.6933) -(15.4746) -(15.3921) (15.6502) .(20.7708) (19.1773) .(22.4069) (19.2745) .(22.3041) (19.3171) .(22.6934) (19.4807) .(22.6654) (19.9419) .(22.9492) (20.3549) .(23.0673) (20.1292) .(23.4202) (19.5708) .(23.7459) (18.8412) .(23.8752) (18.8049) .(23.2098) (18.8887) .(23.3874) (20.0054) [src Sentence]: डॉक्टर के पास इलाज के लिए कैसे जा सकता है. लेकिन, अब सोशल मीडिया पर जो वीडियो वायरल हो रहा है, इसमें ऐसा ही [tgt Sentence]: . - - . . . . . . . ., . . . . . . . . . Alignments : (0.783059) डॉक्टर: .(0.555438) : -(0.424855) -(0.229973) के: .(0.385884) -(0.304172) : -(0.308641) (0.218816) पास: .(0.204042) : (0.276151) (0.204691) इलाज: .(0.354434) : (0.226653) (0.4197) के: .(0.252946) : (0.259685) लिए: .(0.223496) .(0.298742) : (0.25554) (0.226203) कैसे: .(0.294862) : (0.455423) (0.350981) जा: : (0.260478) सकता: .(0.201588) : (0.200678) (0.244826) है: .(0.314572) .(0.436055) .(0.378573) .(0.26414) .: .(0.230587) .(0.290977) .(0.258668) : (0.316975) (0.545832) (0.576576) ,(0.454504) लेकिन: .(0.234922) ,: (0.200762) ,(0.338401) : (0.559449) अब: : (0.428416) (0.388812) सोशल: .(0.521143) .(0.490008) .(0.205166) : मीडिया: : (0.310617) (0.306523) पर: .(0.377871) .(0.46679) .(0.31784) : (0.232435) (0.268699) जो: .(0.251077) .(0.25011) : (0.280583) (0.324845) (0.303605) (0.209351) वीडियो: .(0.271323) .(0.274174) .(0.215008) : वायरल: .(0.236871) .(0.294938) .(0.244644) : (0.246203) (0.21972) हो: .(0.206106) : (0.275871) रहा: : है: ,: : इसमें: : ऐसा: : ही: : Quality: whole(822.332), tokens below: (16.6698) .(20.5038) -(17.0603) -(15.1664) (15.7788) .(21.3082) (18.6155) .(23.0372) (19.1312) .(23.2895) (19.0998) .(23.6159) (18.8656) .(24.3972) (17.8369) .(24.7697) (17.4309) .(25.2638) (17.5783) .(25.6286) ,(17.0016) (16.4469) .(21.346) (18.5714) .(23.2511) (19.243) .(23.1847) (19.7784) .(23.0456) (19.879) .(22.9492) (19.8035) .(22.6116) (20.0328) .(22.8723) (20.0649) .(23.2406) (20.032) .(23.3874) (20.5428) [src Sentence]: कुछ देखने को मिला, जिसे देखने के बाद हर कोई हैरान है. [tgt Sentence]: " - - . . . . . Alignments : "(0.512904) कुछ: "(0.461612) -(0.524698) : -(0.267748) देखने: -(0.209325) -(0.265829) : (0.291005) को: .(0.459104) .(0.251464) : (0.255328) (0.435136) (0.204319) मिला: .(0.340024) ,: (0.320138) (0.398973) (0.243207) : जिसे: : देखने: : के: : बाद: : हर: : कोई: : हैरान: : है: .: : (0.243823) (0.230377) Quality: whole(254.623), tokens below: "(15.8096) -(14.1649) -(14.1768) (14.1623) .(19.8099) (16.1799) .(22.7752) (16.0835) .(23.1637) (16.8683) .(23.2497) (17.8369) .(22.7032) (17.6397) -------------------------- ```

jerinphilip commented 3 years ago

A. Replace <unk> from source capability, irrespective of target being <unk> or not.

encodeWithByteRanges has access to surface-text, which means it can have the capability to distinguish between different <unk> in a source line. For a first implementation, we can just keep using the existing handling of <unk>.
decodeWithByteRanges is perhaps the best place to put this. Alignments are available after Beam-Search decoding, where the data is still marian::Words. The decoded words will have <unk> eventually when we sort out the data cleaning. The words are supplied to decodeWithByteRanges, which we will extend to have (optional) source-raw string-views corresponding to words and HardAlignment extracted after BeamSearch decoding as arguments. With these two additional arguments, we can use HardAlignment to i) resolve which of the source <unk> aligns to a decoded unit, and ii) replace this unit (<unk> or not) in decoded surface with unnormalized raw surface from the source text. Implementation wise, we're additionally accepting source-text and alignments in this function now and using it inside to accomplish the desideratum.

The above will work for even when the NMT systems are not trained with a lot of <unk> in the training data (hopefully, to some extent). Which means it should work for when punctuations are output, not just <unk> as with previous examples in this thread. Potential errors will happen when the decoder LM fills in the blanks for some <unk> with true translation learning from some context surrounding it, which we will substitute back with probably matching source text. I think this is a reasonable trade-off and the first part of solving this.

Far as I understand, spm_encode used in student training with guided-alignments is the text-based one, which means fast_align is run on corpus with raw text representation of what would be <unk> the way marian sees it. Alignments are therefore learnt the way we want them to be (distinguishing between different <unk> through alignment, given fast_align works), even now. Hence fast_align and guided-alignment already prepares the learning of copy-task of <unk> within the network (if data is not super clean).

B. Emojis

There was some discussion surrounding solving this as an emoji pass-through problem. Also something around distinguishing multiple emojis, dealing with class imbalances etc. I think we can introduce a placeholder mechanism for a copy-task, extending A, but still contained in the two functions. Let there be N control symbols denoting "placeholders" marked for copy. In this case, we only need to contain a mechanism within encodeWithByteRanges which assigns different placeholders to different surface text. The assignment can be such that we uniformly sample (with replacement) k "placeholders" in a line from the N, which will deal with class imbalance among emojis (or any unknown for that matter), and give the network a richer notion of what and how to copy? Then put the same placeholder in the target where it appears in the training data. Just <unk> is an instance of this case with N=1, control symbol being <unk>. There are now potentially two sources of truths to do OOV replacement, alignment and which "placeholder" maps to which "placeholder".

Implementation Sketch

browsermt/marian-dev, Implement A, contained in marian-dev (Related: https://github.com/marian-nmt/marian/issues/249?)
browsermt/bergamot-translator: Integrate the updated API (for decodeWithByteRanges) here probably in ResponseBuilder (construct Alignments, then construct translated text using corresponding source sentence lines and HardAlignments).
browsermt/students: Update documentation indicating student models to be trained with <unk>, going easy on the cleaning?
browsermt/marian-dev: Implement B on top of the above? Do we really need B? 1, 2, 3 should suffice for most cases, yes? a. I assume we were looking at some standalone preprocessing/postprocessing tool that accomplishes B and happens in the corpus (text) preprocessing step (working both during inference and training) during previous meeting? I'm unsure how this naturally comes in now. To know <unk> I need SentencePiece, which brings in libmarian so this is not standalone? b. Reserve N symbols in sentencepiece training for OOV placeholders indicating copy? Would it be apt to generalize this further and just reserve some N symbols for all training here which gets assigned meaning later (WMT de-en used a bunch of misc tags standard in vocabulary).

@kpu I will go ahead and try to implement 1, 2 bringing them in respective repositories by Monday so there's something concrete to take forward. I'll wait for inputs/comments on 3 and 4.

jerinphilip commented 3 years ago

This is the en -> de student model.

Config

```py config = { "models": [os.path.join(BERGAMOT_ARCHIVE, "model.intgemm.alphas.bin")], "shortlist": [os.path.join(BERGAMOT_ARCHIVE, "lex.s2t.bin"), True, 50, 50], "vocabs": [ os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), ], "ssplit-prefix-file": os.path.join(BERGAMOT_ARCHIVE, "nonbreaking_prefix.en"), "max-length-break": 128, "mini-batch-words": 1024, "workspace": 128, "skip-cost": True, "cpu-threads": 40, "quiet": True, "quiet-translation": True, "gemm-precision": "int8shiftAlphaAll", "alignment": True, "allow-unk": True, "log": "unk-analyis.log", "log-level": "debug" } ```

What unknowns in source maps to in target, analyzed across MTNT dataset on en source data in train.en-fr.tsv. Top 50 occurrences in target.

With alignmentThreshold=1.0 (hardAlignment)

names	occurences
ist	2663
"	1777
.	1572
"	1569
	978
	942
,	851
'	591
-	398
e	319
s	281
-	217
nicht	152
en	116
".	108
zu	99
es	92
Ich	89
bin	88
die	81
sind	76
st	70
?	69
:	68
das	54
t	52
des	51
Ich	50
der	50
]	48
und	37
sich	37
n	33
be	31
Und	31
wird	31
den	30
ver	28
Es	27
hat	26
geht	26
")	26
gibt	25
)	24
mich	24
er	23
&	23
bis	23
Wir	22
sch	22

With alignmentThreshold = 0.2

names	occurences
ist	2965
"	1904
.	1823
"	1709
	1517
,	1281
nicht	916
'	605
bin	560
s	513
	475
-	463
e	457
sind	276
es	274
-	251
".	215
en	212
Ich	180
zu	151
be	129
gibt	122
die	121
st	113
sich	107
?	92
das	88
ich	85
t	85
:	82
ver	74
und	74
Sie	73
habe	68
n	67
der	66
Ich	66
von	65
des	61
mich	60
geht	56
haben	53
wird	53
hat	50
]	48
Es	44
ge	44
mit	40
in	40
würde	40

Conclusions:

A lot of the unknowns map to punctuations.
Remaining words are those which occur frequently in the target dataset.
We maybe able to implement an early version of unknown replacement substituting hard-alignments in target with source unknowns to some extent. The words we lose are common. ~~TODO(@jerinphilip): Random sample a bunch of sources and check if this works, qualitatively.~~ Failed horribly, wouldn't recommend.
This can be mitigated by using placeholders to indicate copy (maybe just <unk> on source and target side) in training. This however is compute-heavy and all models will need retraining and updating.

Unique unknowns found in source-data

😘
😂
”
🤣
😋
’
“
^
>
🍻
👌🏻
😍
👍
😊
😃
Ó
¯
ツ
~~
💣💣💣
}
😓
😁
🕍🐣
👌
😂👌
Á
😉
😍😍😍
😂😂
😍💋💋
🔥🔥
❤❤❤
~
🎶
👋
༽つ
༼
つ
🧠
🙋🏻
❤️
😞
💖
£
💪👍
👎
💋
´
🌚
😎
🎵
👿
🌊
‘
💰
⭐️
😘😘
👍🏽
ś
❤️💙
🎈
📈
📞
͡
͜ʖ
🙄
👀
🙋🏻‍♂️
😭😭😭😂💔
️
♥️
😑
—
😭
🖕
💕
🔥
🤦🏾‍♀️
😅
🐰🐰🐰
😀
😳
😄
😰
✠
✓✓
😎😎
^^^
🇧🇷
č
😂😂🤣🤣
🅱
😏
🤤
🤔
🌈🌈🌈
🍝
😬
🅱🅱
😊😉
É
🤓
💩
💀🎺
⚾️
💜
🗿
🔥🔥🔥
Λ
Ξ
★
`
😭🙃
😍😍
🙂
†
🇺🇸
😢
😤😤😤😫😫
😳😳😳
💔💔
∞
🙌
💋💋
÷
😲😲
😜
😷
🤗
😭😩
😂😭
😛
>>
💌
😭💜
⚠️
❤
😔
🍖
⚽️
😭😭
🌀
Ñ
😪
☦☦☦
😘🍆💦
😂😂😂😂😂😂
🅱️
😂😂😂
🧐
¿
😂😁
💪
✨
„
ê
🤔🤔
😟😟
😕
☠
😐
😩
✍🏾
🌸
🙄🙄
➕
⃣
👌👌👌
^^^^
😒
ñ
🚨
🧚🏼‍♀️💗
🍆
🙃
–
😍🧜🏻‍♀️
>>>
😋🐰🌿
😈💦
😣
🔥😍😜
🤮🤮🤮🤮🤮🤮🤮
ಠ
🍔
😂😂😂😂😂😂🤣🤣🤣🤣🤣🤣
😌
😆
😂”
«
»
👌🏼
👌🙄
كيس
×
🇱🇷
🖤
🤷
😝😈💧
😫💦
❓
💹
😍🍬
😫
😤
🚊
♥
😂😂😂😂
^^
😆🤣
💋💋💋
🤔🤔🤔
😶
🎵“
”🎵
🍵💕
👌👌
🤪🤪
😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😩
👌😎
🔆
}}
😡
ł
😂💯
😂😂🤣🤣🤣😂💯
😁😁😁
😇
🚩🚩🚩🚩🚩
ï
♡
😩😩
👍🏻😊
‪
‬
😋😋
🖕🏿
͛
⬁
🤐🤐
🤔😳
💙🤓
♬
😈😤
¢
🤰🐽🍣
>”
🎁
🦄
🤷‍♀️
🙁
`~
🔊
🤢
皮
牛革
😂🤣
””
🤔😍
😩😩😩😩
😩😩😩
🕵️‍♂️
~~~~~
🐸
⚾
💥💥💥💥
😝🤮
^^^^^
👌🏽
^^^^^^^^
🍴
😂🔥
>`
😂🖕
🙈
🎩
【ブレフロ】【
】
【
😅😅😅
💚
😊🤘☠️
😉😈😈
✔️
ð
😯
θ
≈
ř
終劇
✊
∫
π
≠
⏭️
⏮️😜😝😜
💙
``
🍹🍸🥃🍷🥂🍻🍺
🍺🍻🥂🍷🥃🍸🍹
☹
🙄😂
☀️
✅
♪
♫
🙆🏽‍♂️🙅🏽‍♂️🤷🏽‍♂️😎
―
⚡
ç
“😂”
☺
🙏🏼
☹️
🤷🏻‍♀️
😈
😴
😘😊
😂😂😥
🚯🚫👌💰😄👌🎠
👌👌👌👌👌
ø
✔
😎🐃✌🏾
☺️
Õ
Áìú
🕟
🕑
🤣🤣
😅😐😞
”—
김세연
세
새
게
개
ㅔ
ㅐ
‽
Ż
🤣🤣🤣
```
ú
🙄🙄🙄
Θ
’”
Ω
😭😘
💕💕💕
😔🤦🏾‍♂️
🤷🏼‍♀️
️♀️
😅😂
♥♥
î
ô
œ
Ê

On the larger MTNT monolingual en data

Top 50 target maps for source unks over dataset

names	occurences
ist	10029
.	7426
	5812
"	5125
"	4108
,	3437
nicht	2803
	2796
bin	1853
e	1747
s	1434
'	1394
-	1018
sind	794
Ich	772
en	751
es	662
-	632
".	568
gibt	419
st	412
be	371
die	351
Es	346
zu	324
t	321
das	315
ver	296
wird	275
:	274
sich	272
n	230
Sie	215
hat	214
habe	210
der	205
ich	201
mich	201
?	197
mir	184
haben	184
Die	156
würde	148
keine	146
Sie	142
tut	142
Ich	141
sch	140
in	135
des	129

jerinphilip commented 2 years ago

@jelmervdl Can you take a look, when you have some time at this issue and https://gist.github.com/jerinphilip/439ba3b25cdd0d8727b0c80956340024? This was a crude something I got to check if an <unk> can be replaced by a single token in target-text finding where it maps to. I believe your insights based on the experience of doing HTML tag transfer should be of great value here.

The query I'm trying to put forth here is - with a refined HTML tag-transfer API, this problem should be the same as "insert pseudo-tags around an emoji, find the matching range in target and copy-contents over from source-text". Are there existing functions/primitives that can be used here, if so could you point to those? Specifically, is there a library function that for source and target, the alignment matrix can aggregate tokens and provide a max-overlap-span, or something similar? If you look at the naive replacement gist, what's happening is there are multiple (punctuation) tokens and text is getting mapped to that.

browsermt / bergamot-translator

Passthroughs if OOV rate is very high #185