browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
330 stars 37 forks source link

Passthroughs if OOV rate is very high #185

Open kpu opened 3 years ago

kpu commented 3 years ago

Wikipedia has names of languages in their own language on the left navbar https://en.wikipedia.org/wiki/Machine_translation like these: العربية Español हिन्दी Bahasa Indonesia Bahasa Melayu Português Русский اردو 中文

Problem is that the, say, German model has no clue what to do with Arabic text input. So it translates as "-".

image

More generally, we could treat OOVs like tags/emojis and pass them through on an alignment basis. SentencePiece does tell us about OOVs.

There is the more general problem of text that is in the vocabulary but wrong language / doesn't make sense.

jerinphilip commented 3 years ago

Isn't high-oov rate pages a tiny fraction of the pages intended for the target audience, or even the web? Moreover, is this not best handled at browser lang-detect? They send us fragments of text, where a user has configured translate language x to language y, and our inputs are (almost) guaranteed to be of language x. Do we need to complicate within bergamot-translator for this?

kpu commented 3 years ago

In general there can be an OOV like this in the middle of a sentence. The langid would be as correct as it can be. If there is an OOV token in the input we should seek to copy its contents to the output.

All this repository does is add a layer on top of raw Marian with sentence splitting. This is the right place to do it.

jerinphilip commented 3 years ago

Reading this as simply implement a "replace unknown in target with max matching piece or something from source text using alignments", still unsure where to draw the "if only high oov rate" switch from, is it supplied from outside? Do we compute it with SentencePiece and make a switch internally? Can't we just do this for all OOV?

If there's an unknown in source, chances are there's an unknown in target and the alignments between source and target unknown are bijective (Is this a correct assumption? There should be corner cases?)

So build translated text and alignments together, replace the existing decoded string with a new decoded string, and updating the decoded string ByteRanges accordingly.

so inject a transform somewhere in code, where transform(vanillaDecoded, vanillaDecodedByteRanges, alignments) = decoded, decodedByteRanges.

From the following output I get, unsure where to begin, any pointers from marian experts? Are the '.' <unk>? Here's a sample output feeding in OOV for the de-en model with alignment information printed:

Hindi fed through en->de ``` [original]: सोशल मीडिया पर अक्सर ऐसे वीडियोज देखने को मिल जाते हैं, जिन्हें देखने के बाद हमें अपनी आखों पर ही विश्वास नहीं होता और कुछ वीडियोज तो ऐसे भी होते हैं, जिन्हें देखकर हम हैरान रह जाते हैं. इन दिनों सोशल मीडिया पर एक ऐसा ही वीडियो वायरल हो रहा है, जिसे देखकर आप हैरान रह जाएंगे. क्या आपने कभी ऐसा सुना या देखा है कि किसी कुत्ते या किसी भी जानवर को चोट लगी हो और वह खुद ही अपना इलाज कराने डॉक्टर के पास पहुंच गया हो ? आप सोच रहे होंगे कि भला की कुत्ता खुद डॉक्टर के पास इलाज के लिए कैसे जा सकता है. लेकिन, अब सोशल मीडिया पर जो वीडियो वायरल हो रहा है, इसमें ऐसा ही कुछ देखने को मिला, जिसे देखने के बाद हर कोई हैरान है. [translated]: " - . . . . . . . . . . . . . . . - - - . . . . . . . . . . . . . . . . "-", . . . . . . . . . . . . . . . . . . - - - - - - - - - - - . . . . . . . . . . . . . - - . . . . . . . ., . . . . . . . . . " - - . . . . . [src Sentence]: सोशल मीडिया पर अक्सर ऐसे वीडियोज देखने को मिल जाते हैं, जिन्हें देखने के बाद हमें अपनी आखों पर ही विश्वास नहीं होता [tgt Sentence]: " - . . . . . . . . . . . . . . Alignments : "(0.956004) सोशल: -(0.583375) : -(0.228428) (0.342378) मीडिया: (0.215835) .(0.353554) : (0.271081) (0.397723) पर: .(0.37035) .(0.286085) : (0.237772) (0.22907) अक्सर: .(0.214375) : (0.366927) (0.20181) ऐसे: .(0.211806) : (0.291449) वीडियोज: .(0.213809) .(0.272131) : (0.244371) (0.326145) देखने: .(0.234805) .(0.222403) : (0.263701) (0.307984) को: : (0.235474) (0.230634) मिल: .(0.282827) .(0.298355) : (0.326475) (0.268782) जाते: .(0.217554) .(0.287406) .(0.239472) : (0.251882) (0.47677) (0.560619) (0.591466) (0.38006) हैं: .(0.248821) .(0.34244) .(0.376725) ,: .(0.237691) (0.323229) .(0.29143) (0.244308) : जिन्हें: : (0.287278) (0.326648) देखने: .(0.248956) : के: .(0.38824) .(0.486206) .(0.238963) : बाद: .(0.228308) : (0.218769) हमें: .(0.259913) : अपनी: : आखों: : पर: : ही: : विश्वास: : नहीं: : होता: : Quality: whole(630.287), tokens below: "(16.4854) -(14.8093) (14.9548) .(20.146) (17.9222) .(22.6871) (18.5127) .(22.6934) (19.2381) .(22.9869) (19.7148) .(22.6242) (19.8084) .(22.2846) (19.5233) .(22.6654) (19.5366) .(22.8989) (18.8636) .(22.395) (19.1529) .(23.0785) (17.9962) .(23.3846) (18.0284) .(23.1868) (19.1584) .(23.0204) (19.5491) .(23.3447) (19.6364) [src Sentence]: और कुछ वीडियोज तो ऐसे भी होते हैं, जिन्हें देखकर हम हैरान रह जाते हैं. इन दिनों सोशल मीडिया पर एक ऐसा ही वीडियो [tgt Sentence]: . - - - . . . . . . . . . . . . . . . . Alignments : (0.818588) और: .(0.357205) : -(0.442216) -(0.2517) कुछ: .(0.50503) : -(0.21589) -(0.31923) वीडियोज: : -(0.426235) (0.49468) (0.208258) तो: .(0.349761) : (0.245715) (0.234698) ऐसे: .(0.329014) .(0.263438) : (0.279558) (0.350182) भी: .(0.248584) .(0.226412) : (0.263694) (0.303624) होते: .(0.240984) .(0.277225) : (0.32832) (0.315904) हैं: .(0.305526) .(0.356893) ,: (0.32313) .(0.213539) (0.308022) .(0.216299) : जिन्हें: : (0.295019) (0.202798) देखकर: : (0.283703) (0.404502) (0.302214) हम: .(0.216913) : हैरान: .(0.206387) : (0.204113) (0.272782) रह: .(0.207866) : जाते: : हैं: .(0.22433) .(0.246221) .: .(0.314097) .(0.262107) : (0.279113) (0.456464) (0.332618) इन: : (0.213202) (0.225647) दिनों: .(0.251294) .(0.326989) .(0.255369) .(0.206421) : (0.24916) (0.254673) (0.21884) सोशल: .(0.276874) .(0.301296) .(0.302771) .(0.201276) : मीडिया: .(0.261903) .(0.342812) : (0.206704) पर: : एक: : ऐसा: : ही: : वीडियो: : Quality: whole(787.333), tokens below: (16.1394) .(20.1153) -(16.6228) -(15.2335) -(15.2523) (15.3559) .(21.3285) (18.3464) .(23.2595) (18.4764) .(23.516) (18.7783) .(24.0205) (17.8907) .(23.9786) (17.6468) .(24.4881) (18.5288) .(23.7913) (19.1997) .(23.1533) (19.8503) .(23.3992) (19.6372) .(24.0604) (19.4911) .(25.7733) (19.1718) .(25.4937) (18.9104) .(24.5971) (19.6162) .(24.7096) (19.2933) .(24.2421) (19.5093) .(24.3504) (20.1053) [src Sentence]: वायरल हो रहा है, जिसे देखकर आप हैरान रह जाएंगे. क्या आपने कभी ऐसा सुना या देखा है कि किसी कुत्ते या किसी भी जानवर [tgt Sentence]: "-", . . . . . . . . . . . . . . . . . Alignments : "(0.652248) वायरल: "(0.336628) -(0.43808) : -(0.371416) "(0.40017) हो: "(0.434341) : ,(0.384477) रहा: : ,(0.266007) है: ,: : (0.322394) जिसे: (0.221961) .(0.305329) .(0.254102) : (0.387028) (0.423057) (0.212407) देखकर: .(0.218747) .(0.217672) : (0.304233) (0.303549) आप: .(0.218633) .(0.240343) : (0.298503) (0.325131) हैरान: .(0.214112) : (0.235449) रह: : जाएंगे: .(0.244301) .(0.313342) .: .(0.223318) .(0.214501) : (0.436911) (0.546866) (0.531091) (0.30733) क्या: : आपने: .(0.22263) .(0.372788) .(0.34158) .(0.228975) : (0.315708) (0.439662) (0.468507) (0.26392) कभी: .(0.283864) .(0.312371) .(0.221311) : (0.24852) ऐसा: .(0.234662) .(0.350711) .(0.295482) : (0.259843) (0.339373) (0.213037) सुना: .(0.218174) : (0.280702) (0.248068) या: .(0.305921) .(0.346372) .(0.298072) : (0.22789) (0.241198) (0.220623) देखा: .(0.20034) .(0.262666) .(0.241539) : (0.204948) (0.213071) है: .(0.231243) .(0.246419) : (0.200413) (0.23977) कि: : किसी: : कुत्ते: : या: : किसी: : भी: : जानवर: : Quality: whole(815.831), tokens below: "(15.8473) -(15.0194) "(14.9181) ,(15.3991) (14.4125) .(20.5192) (18.1332) .(23.7773) (19.1696) .(23.7487) (19.8035) .(24.1198) (19.3437) .(24.3266) (18.9474) .(24.8438) (18.6672) .(25.1848) (19.4212) .(24.9913) (18.9076) .(23.9891) (19.5547) .(23.8871) (19.3716) .(23.5935) (19.8447) .(23.7333) (20.2047) .(23.6844) (20.4975) .(23.8046) (20.3857) .(24.0869) (20.1558) .(24.126) (20.6953) .(23.5984) (21.1165) [src Sentence]: को चोट लगी हो और वह खुद ही अपना इलाज कराने डॉक्टर के पास पहुंच गया हो ? आप सोच रहे होंगे कि भला की कुत्ता खुद [tgt Sentence]: . - - - - - - - - - - - . . . . . . . . . . . . Alignments : (0.692652) को: (0.283539) .(0.644318) : -(0.469442) -(0.272727) चोट: .(0.268899) : -(0.243767) -(0.426045) -(0.469727) लगी: : -(0.239669) हो: : -(0.315251) -(0.442719) -(0.449794) -(0.306429) और: : -(0.223187) -(0.240647) वह: : -(0.272017) -(0.378221) -(0.261925) खुद: : -(0.231067) ही: : -(0.278199) -(0.390146) -(0.380319) (0.256857) अपना: : (0.203615) इलाज: : (0.302454) (0.32013) कराने: : (0.211737) (0.235591) डॉक्टर: : के: .(0.208412) : (0.282954) (0.281797) (0.206605) पास: : (0.210017) (0.20882) पहुंच: .(0.212868) .(0.264379) .(0.240117) : (0.246689) (0.321604) (0.30002) (0.233927) गया: .(0.21623) .(0.282752) .(0.248528) : (0.276151) (0.312417) (0.231658) हो: .(0.214711) .(0.362975) .(0.478412) .(0.481597) .(0.259972) ?: : आप: : (0.233691) (0.230496) सोच: .(0.235191) : (0.259413) (0.254464) रहे: .(0.330843) .(0.432209) : होंगे: : कि: : भला: : की: : कुत्ता: : खुद: : Quality: whole(735.547), tokens below: (16.4182) .(20.4976) -(17.6592) -(15.3145) -(15.6185) -(15.6116) -(15.7283) -(16.1119) -(16.3419) -(15.7527) -(15.6933) -(15.4746) -(15.3921) (15.6502) .(20.7708) (19.1773) .(22.4069) (19.2745) .(22.3041) (19.3171) .(22.6934) (19.4807) .(22.6654) (19.9419) .(22.9492) (20.3549) .(23.0673) (20.1292) .(23.4202) (19.5708) .(23.7459) (18.8412) .(23.8752) (18.8049) .(23.2098) (18.8887) .(23.3874) (20.0054) [src Sentence]: डॉक्टर के पास इलाज के लिए कैसे जा सकता है. लेकिन, अब सोशल मीडिया पर जो वीडियो वायरल हो रहा है, इसमें ऐसा ही [tgt Sentence]: . - - . . . . . . . ., . . . . . . . . . Alignments : (0.783059) डॉक्टर: .(0.555438) : -(0.424855) -(0.229973) के: .(0.385884) -(0.304172) : -(0.308641) (0.218816) पास: .(0.204042) : (0.276151) (0.204691) इलाज: .(0.354434) : (0.226653) (0.4197) के: .(0.252946) : (0.259685) लिए: .(0.223496) .(0.298742) : (0.25554) (0.226203) कैसे: .(0.294862) : (0.455423) (0.350981) जा: : (0.260478) सकता: .(0.201588) : (0.200678) (0.244826) है: .(0.314572) .(0.436055) .(0.378573) .(0.26414) .: .(0.230587) .(0.290977) .(0.258668) : (0.316975) (0.545832) (0.576576) ,(0.454504) लेकिन: .(0.234922) ,: (0.200762) ,(0.338401) : (0.559449) अब: : (0.428416) (0.388812) सोशल: .(0.521143) .(0.490008) .(0.205166) : मीडिया: : (0.310617) (0.306523) पर: .(0.377871) .(0.46679) .(0.31784) : (0.232435) (0.268699) जो: .(0.251077) .(0.25011) : (0.280583) (0.324845) (0.303605) (0.209351) वीडियो: .(0.271323) .(0.274174) .(0.215008) : वायरल: .(0.236871) .(0.294938) .(0.244644) : (0.246203) (0.21972) हो: .(0.206106) : (0.275871) रहा: : है: ,: : इसमें: : ऐसा: : ही: : Quality: whole(822.332), tokens below: (16.6698) .(20.5038) -(17.0603) -(15.1664) (15.7788) .(21.3082) (18.6155) .(23.0372) (19.1312) .(23.2895) (19.0998) .(23.6159) (18.8656) .(24.3972) (17.8369) .(24.7697) (17.4309) .(25.2638) (17.5783) .(25.6286) ,(17.0016) (16.4469) .(21.346) (18.5714) .(23.2511) (19.243) .(23.1847) (19.7784) .(23.0456) (19.879) .(22.9492) (19.8035) .(22.6116) (20.0328) .(22.8723) (20.0649) .(23.2406) (20.032) .(23.3874) (20.5428) [src Sentence]: कुछ देखने को मिला, जिसे देखने के बाद हर कोई हैरान है. [tgt Sentence]: " - - . . . . . Alignments : "(0.512904) कुछ: "(0.461612) -(0.524698) : -(0.267748) देखने: -(0.209325) -(0.265829) : (0.291005) को: .(0.459104) .(0.251464) : (0.255328) (0.435136) (0.204319) मिला: .(0.340024) ,: (0.320138) (0.398973) (0.243207) : जिसे: : देखने: : के: : बाद: : हर: : कोई: : हैरान: : है: .: : (0.243823) (0.230377) Quality: whole(254.623), tokens below: "(15.8096) -(14.1649) -(14.1768) (14.1623) .(19.8099) (16.1799) .(22.7752) (16.0835) .(23.1637) (16.8683) .(23.2497) (17.8369) .(22.7032) (17.6397) -------------------------- ```
jerinphilip commented 3 years ago

A. Replace <unk> from source capability, irrespective of target being <unk> or not.

  1. encodeWithByteRanges has access to surface-text, which means it can have the capability to distinguish between different <unk> in a source line. For a first implementation, we can just keep using the existing handling of <unk>.
  2. decodeWithByteRanges is perhaps the best place to put this. Alignments are available after Beam-Search decoding, where the data is still marian::Words. The decoded words will have <unk> eventually when we sort out the data cleaning. The words are supplied to decodeWithByteRanges, which we will extend to have (optional) source-raw string-views corresponding to words and HardAlignment extracted after BeamSearch decoding as arguments. With these two additional arguments, we can use HardAlignment to i) resolve which of the source <unk> aligns to a decoded unit, and ii) replace this unit (<unk> or not) in decoded surface with unnormalized raw surface from the source text. Implementation wise, we're additionally accepting source-text and alignments in this function now and using it inside to accomplish the desideratum.

The above will work for even when the NMT systems are not trained with a lot of <unk> in the training data (hopefully, to some extent). Which means it should work for when punctuations are output, not just <unk> as with previous examples in this thread. Potential errors will happen when the decoder LM fills in the blanks for some <unk> with true translation learning from some context surrounding it, which we will substitute back with probably matching source text. I think this is a reasonable trade-off and the first part of solving this.

Far as I understand, spm_encode used in student training with guided-alignments is the text-based one, which means fast_align is run on corpus with raw text representation of what would be <unk> the way marian sees it. Alignments are therefore learnt the way we want them to be (distinguishing between different <unk> through alignment, given fast_align works), even now. Hence fast_align and guided-alignment already prepares the learning of copy-task of <unk> within the network (if data is not super clean).

B. Emojis

There was some discussion surrounding solving this as an emoji pass-through problem. Also something around distinguishing multiple emojis, dealing with class imbalances etc. I think we can introduce a placeholder mechanism for a copy-task, extending A, but still contained in the two functions. Let there be N control symbols denoting "placeholders" marked for copy. In this case, we only need to contain a mechanism within encodeWithByteRanges which assigns different placeholders to different surface text. The assignment can be such that we uniformly sample (with replacement) k "placeholders" in a line from the N, which will deal with class imbalance among emojis (or any unknown for that matter), and give the network a richer notion of what and how to copy? Then put the same placeholder in the target where it appears in the training data. Just <unk> is an instance of this case with N=1, control symbol being <unk>. There are now potentially two sources of truths to do OOV replacement, alignment and which "placeholder" maps to which "placeholder".

Implementation Sketch

  1. browsermt/marian-dev, Implement A, contained in marian-dev (Related: https://github.com/marian-nmt/marian/issues/249?)
  2. browsermt/bergamot-translator: Integrate the updated API (for decodeWithByteRanges) here probably in ResponseBuilder (construct Alignments, then construct translated text using corresponding source sentence lines and HardAlignments).
  3. browsermt/students: Update documentation indicating student models to be trained with <unk>, going easy on the cleaning?
  4. browsermt/marian-dev: Implement B on top of the above? Do we really need B? 1, 2, 3 should suffice for most cases, yes? a. I assume we were looking at some standalone preprocessing/postprocessing tool that accomplishes B and happens in the corpus (text) preprocessing step (working both during inference and training) during previous meeting? I'm unsure how this naturally comes in now. To know <unk> I need SentencePiece, which brings in libmarian so this is not standalone? b. Reserve N symbols in sentencepiece training for OOV placeholders indicating copy? Would it be apt to generalize this further and just reserve some N symbols for all training here which gets assigned meaning later (WMT de-en used a bunch of misc tags standard in vocabulary).

@kpu I will go ahead and try to implement 1, 2 bringing them in respective repositories by Monday so there's something concrete to take forward. I'll wait for inputs/comments on 3 and 4.

jerinphilip commented 3 years ago

This is the en -> de student model.

Config

```py config = { "models": [os.path.join(BERGAMOT_ARCHIVE, "model.intgemm.alphas.bin")], "shortlist": [os.path.join(BERGAMOT_ARCHIVE, "lex.s2t.bin"), True, 50, 50], "vocabs": [ os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), os.path.join(BERGAMOT_ARCHIVE, "vocab.deen.spm"), ], "ssplit-prefix-file": os.path.join(BERGAMOT_ARCHIVE, "nonbreaking_prefix.en"), "max-length-break": 128, "mini-batch-words": 1024, "workspace": 128, "skip-cost": True, "cpu-threads": 40, "quiet": True, "quiet-translation": True, "gemm-precision": "int8shiftAlphaAll", "alignment": True, "allow-unk": True, "log": "unk-analyis.log", "log-level": "debug" } ```

What unknowns in source maps to in target, analyzed across MTNT dataset on en source data in train.en-fr.tsv. Top 50 occurrences in target.

With alignmentThreshold=1.0 (hardAlignment)
names occurences
ist 2663
" 1777
. 1572
" 1569
978
942
, 851
' 591
- 398
e 319
s 281
- 217
nicht 152
en 116
". 108
zu 99
es 92
Ich 89
bin 88
die 81
sind 76
st 70
? 69
: 68
das 54
t 52
des 51
Ich 50
der 50
] 48
und 37
sich 37
n 33
be 31
Und 31
wird 31
den 30
ver 28
Es 27
hat 26
geht 26
") 26
gibt 25
) 24
mich 24
er 23
& 23
bis 23
Wir 22
sch 22
With alignmentThreshold = 0.2
names occurences
ist 2965
" 1904
. 1823
" 1709
1517
, 1281
nicht 916
' 605
bin 560
s 513
475
- 463
e 457
sind 276
es 274
- 251
". 215
en 212
Ich 180
zu 151
be 129
gibt 122
die 121
st 113
sich 107
? 92
das 88
ich 85
t 85
: 82
ver 74
und 74
Sie 73
habe 68
n 67
der 66
Ich 66
von 65
des 61
mich 60
geht 56
haben 53
wird 53
hat 50
] 48
Es 44
ge 44
mit 40
in 40
würde 40

Conclusions:

Unique unknowns found in source-data
😘
😂
”
🤣
😋
’
“
^
>
🍻
👌🏻
😍
👍
😊
😃
Ó
¯
ツ
~~
💣💣💣
}
😓
😁
🕍🐣
👌
😂👌
Á
😉
😍😍😍
😂😂
😍💋💋
🔥🔥
❤❤❤
~
🎶
👋
༽つ
༼
つ
🧠
🙋🏻
❤️
😞
💖
£
💪👍
👎
💋
´
🌚
😎
🎵
👿
🌊
‘
💰
⭐️
😘😘
👍🏽
ś
❤️💙
🎈
📈
📞
͡
͜ʖ
🙄
👀
🙋🏻‍♂️
😭😭😭😂💔
️
♥️
😑
—
😭
🖕
💕
🔥
🤦🏾‍♀️
😅
🐰🐰🐰
😀
😳
😄
😰
✠
✓✓
😎😎
^^^
🇧🇷
č
😂😂🤣🤣
🅱
😏
🤤
🤔
🌈🌈🌈
🍝
😬
🅱🅱
😊😉
É
🤓
💩
💀🎺
⚾️
💜
🗿
🔥🔥🔥
Λ
Ξ
★
`
😭🙃
😍😍
🙂
†
🇺🇸
😢
😤😤😤😫😫
😳😳😳
💔💔
∞
🙌
💋💋
÷
😲😲
😜
😷
🤗
😭😩
😂😭
😛
>>
💌
😭💜
⚠️
❤
😔
🍖
⚽️
😭😭
🌀
Ñ
😪
☦☦☦
😘🍆💦
😂😂😂😂😂😂
🅱️
😂😂😂
🧐
¿
😂😁
💪
✨
„
ê
🤔🤔
😟😟
😕
☠
😐
😩
✍🏾
🌸
🙄🙄
➕
⃣
👌👌👌
^^^^
😒
ñ
🚨
🧚🏼‍♀️💗
🍆
🙃
–
😍🧜🏻‍♀️
>>>
😋🐰🌿
😈💦
😣
🔥😍😜
🤮🤮🤮🤮🤮🤮🤮
ಠ
🍔
😂😂😂😂😂😂🤣🤣🤣🤣🤣🤣
😌
😆
😂”
«
»
👌🏼
👌🙄
كيس
×
🇱🇷
🖤
🤷
😝😈💧
😫💦
❓
💹
😍🍬
😫
😤
🚊
♥
😂😂😂😂
^^
😆🤣
💋💋💋
🤔🤔🤔
😶
🎵“
”🎵
🍵💕
👌👌
🤪🤪
😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😩
👌😎
🔆
}}
😡
ł
😂💯
😂😂🤣🤣🤣😂💯
😁😁😁
😇
🚩🚩🚩🚩🚩
ï
♡
😩😩
👍🏻😊
‪
‬
😋😋
🖕🏿
͛
⬁
🤐🤐
🤔😳
💙🤓
♬
😈😤
¢
🤰🐽🍣
>”
🎁
🦄
🤷‍♀️
🙁
`~
🔊
🤢
皮
牛革
😂🤣
””
🤔😍
😩😩😩😩
😩😩😩
🕵️‍♂️
~~~~~
🐸
⚾
💥💥💥💥
😝🤮
^^^^^
👌🏽
^^^^^^^^
🍴
😂🔥
>`
😂🖕
🙈
🎩
【ブレフロ】【
】
【
😅😅😅
💚
😊🤘☠️
😉😈😈
✔️
ð
😯
θ
≈
ř
終劇
✊
∫
π
≠
⏭️
⏮️😜😝😜
💙
``
🍹🍸🥃🍷🥂🍻🍺
🍺🍻🥂🍷🥃🍸🍹
☹
🙄😂
☀️
✅
♪
♫
🙆🏽‍♂️🙅🏽‍♂️🤷🏽‍♂️😎
―
⚡
ç
“😂”
☺
🙏🏼
☹️
🤷🏻‍♀️
😈
😴
😘😊
😂😂😥
🚯🚫👌💰😄👌🎠
👌👌👌👌👌
ø
✔
😎🐃✌🏾
☺️
Õ
Áìú
🕟
🕑
🤣🤣
😅😐😞
”—
김세연
세
새
게
개
ㅔ
ㅐ
‽
Ż
🤣🤣🤣
```
ú
🙄🙄🙄
Θ
’”
Ω
😭😘
💕💕💕
😔🤦🏾‍♂️
🤷🏼‍♀️
️♀️
😅😂
♥♥
î
ô
œ
Ê

On the larger MTNT monolingual en data

Top 50 target maps for source unks over dataset
names occurences
ist 10029
. 7426
5812
" 5125
" 4108
, 3437
nicht 2803
2796
bin 1853
e 1747
s 1434
' 1394
- 1018
sind 794
Ich 772
en 751
es 662
- 632
". 568
gibt 419
st 412
be 371
die 351
Es 346
zu 324
t 321
das 315
ver 296
wird 275
: 274
sich 272
n 230
Sie 215
hat 214
habe 210
der 205
ich 201
mich 201
? 197
mir 184
haben 184
Die 156
würde 148
keine 146
Sie 142
tut 142
Ich 141
sch 140
in 135
des 129
jerinphilip commented 2 years ago

@jelmervdl Can you take a look, when you have some time at this issue and https://gist.github.com/jerinphilip/439ba3b25cdd0d8727b0c80956340024? This was a crude something I got to check if an <unk> can be replaced by a single token in target-text finding where it maps to. I believe your insights based on the experience of doing HTML tag transfer should be of great value here.

The query I'm trying to put forth here is - with a refined HTML tag-transfer API, this problem should be the same as "insert pseudo-tags around an emoji, find the matching range in target and copy-contents over from source-text". Are there existing functions/primitives that can be used here, if so could you point to those? Specifically, is there a library function that for source and target, the alignment matrix can aggregate tokens and provide a max-overlap-span, or something similar? If you look at the naive replacement gist, what's happening is there are multiple (punctuation) tokens and text is getting mapped to that.