Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
323 stars 40 forks source link

broken translation, many repetitions & duplicates #62

Open kindziora opened 2 years ago

kindziora commented 2 years ago

Hi Guys,

Problem: regardless of the model i use, there are situations where the translation is broken, and contains many repetitions.

One Example:

echo "30.4 C\nYaoundé\n \nLUNDI, 4 OCTOBRE 2021 11:46\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n \nINTERNET\n \nENTRETIENS\n \nFRANÇAIS\nMORE\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n" | ./opusMT-client.py -H localhost -s fr -t en

marian-opus-fr-en arguments

--alignment -p 11002 -b2 -n1 -m /usr/local/share/opusMT/models/fr-en/opus.npz -v /usr/local/share/opusMT/models/fr-en/opus.vocab.yml /usr/local/share/opusMT/models/fr-en/opus.vocab.yml

opusMT-opus-fr-en arguments

-p 20012 -c /var/cache/opusMT/opus.fr-en.cache.db --spm /usr/local/share/opusMT/models/fr-en/opus.fr.spm --mtport 11002 -s fr -t en

Result:

 { 
    "alignment": [
        "0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12 13-13 14-14 15-15 16-16 17-17 19-18 20-19 21-20 22-21 23-22 24-23 25-24",
        "0-0 2-105 4-3 7-1 8-2 10-4 10-184 10-194 11-5 11-45 11-55 11-60 11-65 11-110 11-115 11-120 11-125 11-130 11-135 11-140 11-145 12-6 12-81 12-86 12-91 12-141 12-151 12-156 12-161 12-166 12-231 12-236 12-241 13-50 13-70 13-75 13-80 13-85 13-90 13-95 13-100 13-150 13-155 13-160 13-165 13-170 13-175 13-180 13-185 13-190 13-195 13-200 13-205 13-210 13-215 13-220 13-225 13-230 13-235 13-240 13-245 13-250 13-255 13-260 13-265 15-8 15-13 15-83 15-88 15-138 15-143 15-148 15-153 15-158 15-163 15-168 15-173 15-178 15-203 15-208 15-213 15-218 15-223 15-228 15-233 15-238 15-243 15-248 15-253 15-258 15-263 19-259 20-261 21-7 21-72 21-77 21-82 21-227 21-232 21-237 21-242 21-247 22-9 22-14 22-84 22-89 22-94 22-99 22-139 22-144 22-149 22-154 22-159 22-164 22-169 22-174 22-179 22-219 22-224 22-229 22-234 22-239 22-244 22-249 22-254 22-264 23-10 24-11 24-16 24-251 24-256 32-15 47-96 47-171 47-176 57-47 57-52 57-62 57-67 67-18 67-43 67-48 67-53 67-58 67-63 67-68 67-73 67-78 67-93 67-98 67-103 67-108 67-113 67-118 67-123 67-128 67-133 67-183 67-188 67-193 67-198 70-12 70-252 70-257 70-262 73-19 74-20 76-25 78-23 78-28 83-21 83-266 84-17 84-22 84-27 84-57 84-87 84-92 84-97 84-102 84-107 84-112 84-117 84-122 84-127 84-132 84-137 84-142 84-147 84-152 84-157 84-162 84-167 84-172 84-177 84-182 84-187 84-192 84-197 84-202 84-207 84-212 84-217 84-222 85-24 85-104 85-109 85-114 85-119 85-189 85-199 86-30 87-26 87-31 87-36 87-41 87-46 87-51 87-56 87-61 87-66 87-71 87-101 87-106 87-111 87-116 87-121 87-126 87-131 87-136 87-146 87-181 87-186 87-191 87-196 87-201 87-206 87-211 87-216 89-33 89-38 93-32 93-37 93-42 94-29 94-34 94-39 94-44 94-49 94-54 94-59 94-64 95-35 95-40 96-76 96-221 96-226 96-246 102-69 102-74 102-79 102-124 102-129 102-134 102-204 102-209 102-214"
    ],
    "result": "30.4 C\\nYaound\u00e9\\n \\nLUNDI, 4 OCTOBER 2021 11: CENTRAL AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\NLAND",
    "segmentation": "spm",
    "server": "localhost:20012",
    "source": "fr",
    "source-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n L UND I , \u25814 \u2581 OC TO BRE \u258120 21 \u258111 :",
        "\u258146 \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n \u2581\\ n INTER NET \\ n \u2581\\ n ENT RET IENS \\ n \u2581\\ n FR AN \u00c7 AIS \\ n M ORE \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n"
    ],
    "target": "en",
    "target-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n LU ND I , \u25814 \u2581OCT O BER \u258120 21 \u258111 :",
        "\u2581C ENT RAL \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ N LAND"
    ]
}

As you can see the result contains many times:WEST AFRICA

Question: Has anybody an idea why this happens? Could it be related to marian-decoder or sentencepiece?

Kind Regards

Alex

jorgtied commented 2 years ago

Could it be because of all the capital letters? Did your try with normal text that is not in all capitals?