Create langcheck.utils.detect_language()

syamaco commented 10 months ago

Hello, I have a question about the following test code.

Question: Is it possible to treat different languages uniformly?
1. Automatic detect languages. (e.g., EN & JA)
2. Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

import langcheck

generated_outputs = [
    '適度な運動は健康に良いとされています。',
    '適度な運動は健康に悪いとされています。',
    '過度の運動は健康に良いとされています。',
    '過度の運動は健康に悪いとされています。',
    'Moderate exercise is good for your health.',
    'Moderate exercise is bad for your health.',
    'Excessive exercise is good for your health.',
    'Excessive exercise is bad for your health.',
]

# Toxicity
display(langcheck.metrics.ja.toxicity(generated_outputs) < 0.2)
display(langcheck.metrics.en.toxicity(generated_outputs) < 0.2)

Thank you in advance.

kennysong commented 10 months ago

Hi @syamaco, thanks for the question!

Automatic detect languages. (e.g., EN & JA)

I think it makes sense to include a langcheck.utils.detect_language() function in LangCheck. There should be some well known heuristics we can use to implement this.

Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

Currently, English toxicity (detoxify) and Japanese toxicity (fine-tuned line-distilbert-base-japanese) are completely different models, so it may not be optimal to set a single threshold for both models.

If you use OpenAI to compute toxicity, I believe it's using the exact same model and prompt for both languages, so you might be able to set a single threshold in that case.

@liwii if you have any suggestions, feel free to chime in!

syamaco commented 10 months ago

Hi @kennysong, thanks for your response!

It would be helpful to have a language detection function like that. Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

I've come to understand that the threshold for toxicity can vary depending on the language model. Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

I will try OpenAI's language model at least once, referring to the sample code.

Thank you.

kennysong commented 10 months ago

Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

Got it! What would be a useful output of langcheck.utils.detect_language() if there are multiple languages? The simplest idea is a list like ['en', 'ja'], but I think there are many other options that could be useful.

Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

This is a good point – I think it's a good idea to pin a specific version of a HuggingFace model in LangCheck where possible. Then we can control model upgrades in LangCheck versions. We can track this as a separate feature request.

syamaco commented 10 months ago

@kennysong san, thank you for the suggestion.

Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?

kennysong commented 10 months ago

Yes, I think we can use https://github.com/pemistahl/lingua-py to output confidence scores for language detection.

I'm not quite sure how they handle confidence scores for input with multiple languages, though. We'll need to dig into that later.

From your perspective, what do you expect the probabilities to be when a sentence contains both English and Japanese? {'en': 1.0, 'ja': 1.0} or {'en': 0.5, 'ja': 0.5}?

On Fri, Dec 8, 2023 at 9:02 AM syamaco @.***> wrote:

@kennysong https://github.com/kennysong san, thank you for the suggestion.

Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?

— Reply to this email directly, view it on GitHub https://github.com/citadel-ai/langcheck/issues/67#issuecomment-1846289949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQZCQ3MDWZRVH2UR6FNOQTYIJKKDAVCNFSM6AAAAABAI6QSEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGI4DSOJUHE . You are receiving this because you were mentioned.Message ID: @.***>

syamaco commented 9 months ago

@kennysong san,

I felt that results similar to detector.compute_language_confidence_values() in Lingua were natural. https://github-com.translate.goog/pemistahl/lingua-py?_x_tr_sl=auto&_x_tr_tl=ja&_x_tr_hl=ja&_x_tr_pto=wapp#113-confidence-values

ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01

If we can identify the main language of the text using langcheck.utils.detect_language() along with its probability, I think it could be used as a basis for deciding whether to process it with langcheck.metrics.ja.toxicity() or langcheck.metrics.en.toxicity(). Additionally, in cases where the detection values are similar, we may also consider opting not to process the text.

thank you.

kennysong commented 9 months ago

Sounds good, we can try the default compute_language_confidence_values() first.

I'm not sure that it'll actually return {"en": 0.5, "ja": 0.5} for a sentence with equal amounts of English and Japanese, so we should test it later.

Other options are to use compute_language_confidence_values() on a single language at a time or detect_multiple_languages_of().

citadel-ai / langcheck

Create langcheck.utils.detect_language() #67