Open kindziora opened 2 years ago
I think I commented on this already somewhere else: Could you try without the all-capitals text? That might work better. I believe that there are too few examples during training with capital letters all the way ...
Thank you for your reply, (its a re-post from https://github.com/Helsinki-NLP/OPUS-MT-train/issues/62) Now it removes all the text and puts just line breaks. :( It might be because of special characters i will investigate this further.
Kind regards
Alex
echo "30.4 c\nyaoundé\n \nlundi, 4 octobre 2021 11:46\nafrique centrale\n \nafrique de l’ouest\n \ntélécoms\n \ninnovation\n \ninternet\n \nentretiens\n \nfrançais\nmore\nafrique centrale\n \nafrique de l’ouest\n \ntélécoms\n \ninnovation\n" | ./opusMT-client.py -H localhost -s fr -t en
{
"alignment": [
"0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12 13-13 14-14 15-15 16-16 17-17 18-18 19-19 20-20",
"0-0 1-1 2-2 5-3 7-4 7-6 7-8 7-10 7-12 7-14 8-5 8-7 8-9 8-11 8-13 8-15 16-17 16-19 17-16 17-18 17-20 31-22 31-24 31-26 31-28 36-30 40-32 40-34 40-36 41-33 41-35 41-37 41-39 44-41 45-38 45-40 45-42 45-44 45-46 45-48 45-50 45-52 47-43 47-45 47-47 47-49 47-51 47-53 47-55 47-57 47-59 47-61 47-63 47-65 47-67 47-87 47-89 47-91 47-93 47-95 47-97 47-99 47-101 47-103 47-105 47-107 47-109 53-54 53-56 53-58 53-60 53-62 53-64 53-66 53-68 53-70 53-72 53-74 53-76 53-78 53-80 53-82 53-84 53-86 53-88 53-90 53-92 53-94 53-96 53-98 53-100 53-102 53-104 53-106 53-108 53-110 53-112 53-114 53-116 53-118 53-120 60-21 60-23 60-25 60-27 60-29 60-31 60-147 62-111 62-113 62-115 62-117 62-119 62-121 62-123 62-143 62-145 62-159 62-179 62-181 62-183 62-185 62-187 62-189 70-140 70-142 70-144 70-146 70-148 70-150 70-152 70-156 70-158 70-160 70-162 70-164 70-166 70-168 70-170 70-172 70-174 70-176 70-178 70-180 70-182 70-184 70-186 70-188 70-190 70-192 70-194 70-196 70-198 70-200 70-212 70-214 70-216 70-220 70-222 70-224 70-226 70-230 71-139 71-141 74-69 74-71 74-73 74-75 74-77 74-79 74-81 74-83 74-85 74-125 74-127 74-129 74-131 74-133 74-135 74-137 74-149 74-151 74-153 74-155 74-157 74-161 74-163 74-165 74-167 74-169 74-171 74-173 74-175 74-177 74-191 74-193 74-195 74-197 74-199 74-201 74-203 74-205 74-207 74-209 74-211 74-213 74-215 74-217 74-219 74-221 74-223 74-225 74-227 74-229 75-122 75-124 75-126 75-128 75-130 75-132 75-134 75-136 75-138 75-154 75-202 75-204 75-206 75-208 75-210 75-218 75-228"
],
"result": "30.4 c\\nyaound\u00e9\\n \\nlundi, 4 October 2021 11: 46\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n",
"segmentation": "spm",
"server": "localhost:20012",
"source": "fr",
"source-segments": [
"\u258130 .4 \u2581c \\ ny a ound \u00e9 \\ n \u2581\\ n lu ndi , \u25814 \u2581octobre \u258120 21 \u258111 :",
"\u258146 \\ na f rique \u2581centrale \\ n \u2581\\ na f rique \u2581de \u2581l ' ouest \\ n \u2581\\ nt \u00e9l\u00e9 com s \\ n \u2581\\ n innovation \\ n \u2581\\ n internet \\ n \u2581\\ n entretien s \\ n \u2581\\ n fran\u00e7ais \\ n more \\ na f rique \u2581centrale \\ n \u2581\\ na f rique \u2581de \u2581l ' ouest \\ n \u2581\\ nt \u00e9l\u00e9 com s \\ n \u2581\\ n innovation \\ n"
],
"target": "en",
"target-segments": [
"\u258130 .4 \u2581c \\ ny a ound \u00e9 \\ n \u2581\\ n l undi , \u25814 \u2581October \u258120 21 \u258111 :",
"\u258146 \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n"
]
}
`
I guess that the input text is just too different from what the model has seen in training. It is trained with sentences but the input is very much fragmented with short terms and phrases. Does that also happen with full sentences on one line as input?
Hi Guys,
Problem: regardless of the model i use, there are situations where the translation is broken, and contains many repetitions.
One Example:
echo "30.4 C\nYaoundé\n \nLUNDI, 4 OCTOBRE 2021 11:46\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n \nINTERNET\n \nENTRETIENS\n \nFRANÇAIS\nMORE\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n" | ./opusMT-client.py -H localhost -s fr -t en
marian-opus-fr-en arguments
--alignment -p 11002 -b2 -n1 -m /usr/local/share/opusMT/models/fr-en/opus.npz -v /usr/local/share/opusMT/models/fr-en/opus.vocab.yml /usr/local/share/opusMT/models/fr-en/opus.vocab.yml
opusMT-opus-fr-en arguments
-p 20012 -c /var/cache/opusMT/opus.fr-en.cache.db --spm /usr/local/share/opusMT/models/fr-en/opus.fr.spm --mtport 11002 -s fr -t en
Result:
As you can see the result contains many times:WEST AFRICA
Question: Has anybody an idea why this happens? Could it be related to marian-decoder or sentencepiece?
Kind Regards
Alex