AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
226 stars 63 forks source link

Weird Translation issues in Malayalam #50

Closed kurianbenoy closed 7 months ago

kurianbenoy commented 7 months ago
  1. Translating words with dots like U.D.F or B.J.P get's wrong by the model, while simply BJP works fine as well. This occurs in case of names of like V.D Satheeshan as well. In English translation it's comes as Pinarayi Vijayan D.

Input Text

കേരളത്തിലെ പ്രമുഖ UDF നേതാക്കൾ നാളെ ബിജെപിയിൽ ചേരും, വരുംദിവസങ്ങളിൽ LDF നേതാക്കളും- സുരേന്ദ്രൻ
തിരുവനന്തപുരം: പ്രധാനമന്ത്രി നരേന്ദ്രമോദിയുടെ സന്ദർശനത്തിന് മുന്നോടിയായി കേരളത്തിലെ പ്രമുഖ എൽ.ഡി.എഫ്., യു.ഡി.എഫ്. നേതാക്കൾ ബിജെപിയിൽ അംഗത്വമെടുക്കുമെന്ന് ബി.ജെ.പി. സംസ്ഥാന അധ്യക്ഷൻ കെ. സുരേന്ദ്രൻ. പൗരത്വ നിയമം കേരളത്തിലും നടപ്പാക്കുമെന്നും പിണറായി വിജയന്റെയും വി.ഡി. സതീശന്റെയും വാക്കുകേട്ട് തുള്ളാൻ നിന്നാൽ നിങ്ങൾ വെള്ളത്തിലാകുമെന്നും കെ. സുരേന്ദ്രൻ പറഞ്ഞു.

Output Text:

Prominent UDF leaders in Kerala to join BJP tomorrow, LDF leaders in coming days: Surendran.Thiruvananthapuram: Ahead of Prime Minister Narendra Modi\'s visit to Kerala, a prominent lawyer from Kerala has come forward..D.F., U.D.F.BJP leaders to join party soon.J.P.State president K.S..Surendran..Citizenship Act will be implemented in Kerala too, says Pinarayi Vijayan.D.If you stop to listen to Satheesan\'s words, you will be in the water..Surendran said..
  1. Sometimes a person with he gender on translation is converted to she in English. I noticed this few times. If you need samples do let me know
PranjalChitale commented 7 months ago

The IndicTrans2 model is trained on a general domain corpus (BPCC), which might potentially lack adequate representation of such abbreviations. You can consider fine-tuning to improve performance on such cases.

One additional point to note is that sentence segmentation tools may inadvertently fragment sentences at periods within these abbreviations, thereby leading to incomplete sentence being passed to the model, consequently yielding suboptimal translations.

Regarding (2), this might be primarily due to biases arising from the training data, which may be a bit hard to directly control. Probably fine-tuning the model may help.

kurianbenoy commented 7 months ago

Thank you @PranjalChitale for suggesting what next to do. Is there any updates planned to IndicTrans2 models anytime soon?

jaygala24 commented 7 months ago

Hi @kurianbenoy

We do not plan to update IndicTrans2 models anytime soon. Thanks!