brave / brave-browser

Brave browser for Android, iOS, Linux, macOS, Windows.
https://brave.com
Mozilla Public License 2.0
17.87k stars 2.34k forks source link

translate detects Coptic as Marathi #35788

Open rillian opened 9 months ago

rillian commented 9 months ago

Description

The page translation feature mis-detects Coptic as Marathi, which is distracting and not helpful.

Steps to Reproduce

  1. Visit a page with mostly Coptic text
  2. Click 'English' in the translation controls

Actual result:

Translation control dropdown opens offering to translate Marathi to English: image Clicking on 'English' replaces the text with question marks: image

Expected result:

Translation drop-down should not open automatically when translation is not possible. Requesting a translation shouldn't mangle text in other languages.

Reproduces how often:

Easily

Brave version (brave://version info)

1.63.141 Chromium: 121.0.6167.139 (Official Build) beta (64-bit)

Version/Channel Information:

Miscellaneous Information:

Coptic and Marathi use distinct unicode code blocks, so this sort of mis-detection could probably be avoided by a simple pre-filter based on character distribution if the models aren't making correct determiniations.

rebron commented 9 months ago

cc: @atuchin-m

atuchin-m commented 9 months ago

The page doesn't provide any related lang tags, so the only way we could guess the language is third-party CLD3 engine. It only approximation and can't get the correct result in 100% of sites. Chrome uses another, more advanced model and still detects the page as CN.

The only way is to select the language manually. You also could send the bug report to the page owners.

atuchin-m commented 9 months ago

brave://translate-internals/#detection-logs could helps to understand the logic.

atuchin-m commented 9 months ago

Also translate could be disabled for a specific site

image
rillian commented 9 months ago

Thanks, those are good workarounds. And perhaps Coptic is such a minority language we don't want to bother, but I was reporting it as an annoyance I'd encountered.

Adding a lang attribute does suppress the translation controls entirely, so that's a good solution for page authors.