Method to differentiate languages with Alphabets

anaclumos commented 4 years ago

Only seeking for certain character sets cannot determine languages in Alphabets. For example, "Je m'applle Sunghyun" and "I am Sunghyun" cannot be differentiated only with their character sets.

qgustavor commented 3 years ago

There are some libraries that can detect languages: cld3-asm or Whatlang.

Whatlang have an online demo which works using WASM. Seems it's not reliable with short texts: "I am Sunghyun" is detected as Latvian, "Je m'applle Sunghyun" as German. It returns a confidence score sometimes it's not useful: it detects "Karaoke: What song do u wanna sing? This guy: PA" (I took this comment from a video) as Javanese with 100% confidence. Well, at least it's better than detecting using character sets and it's more reliable with longer texts.

anaclumos commented 3 years ago

Hi @qgustavor, thank you for your attention. I have tested several libraries to make this extension more adaptable for the global audience (I was working on related projects until very recently). I wasn't able to spare more time to work on this, and thus it has remained the same so far. This was my first (and therefore very primitive) public product; there are other various factors that need improvements/fixes. Implementing the language analyzer and addressing those issues would be tough enough that it would be better to scrape the entire thing over.

At the same time, YouTube recently announced comment translate feature that works with Google Translate. I've been using it for the past month and it works very reliably. Thus I came to a conclusion where there is no point in continuing the product; I plan to freeze the repository and announce the official end of the development (even bug fixes too. refer to /updates) by the time when YouTube comment translate feature settles down as a norm.

qgustavor commented 3 years ago

I don't like that: Google is favoring comments based on IP location over more relevant comments and that's why I want to filter those comments out. I don't want Google to guess I can't understand what people are writing and push comments from locals so I can engage more in the conversation and in the website. In fact that turns me off from reading comments, I don't want to read 'First!" or "That's good" comments just because they're from locals, I want actually relevant and good comments! If they start translating comments then this issue will only get worse as it will make it harder to filter them out (I'm already filtering those out manually by skipping those).

anaclumos / youtube-comment-language-filter

Method to differentiate languages with Alphabets #5