CodeWithKyrian / transformers-php

Transformers PHP is a toolkit for PHP developers to add machine learning magic to their projects easily.
https://codewithkyrian.github.io/transformers-php/
Apache License 2.0
291 stars 16 forks source link

Regex for detecting language codes incorrect #43

Closed Thorry84 closed 1 week ago

Thorry84 commented 1 month ago

System Info

Ubuntu, PHP 8.1.2

PHP Version

8.1.2

Environment/Platform

Description

In the Codewithkyrian\Transformers\PretrainedTokenizers\NllbTokenizer class this regex is used to detect language codes: /^[a-z]{3}_[A-Z]{3}$/ However some models, like Xenova/nllb-200-distilled-600M use a format like eng_Latn (full list)

I would suggest something like /^[a-z]{3}_[a-zA-Z]{3,4}$/

Is there a big penalty to false positives here? Is this check required?

Reproduction

$trans = pipeline('translation', 'Xenova/nllb-200-distilled-600M');
$trans('Translation test', srcLang: 'eng_Latn', tgtLang: 'deu_Latn');
CodeWithKyrian commented 1 week ago

You're right, the current regex does not accommodate the formats provided by this particular model. Didn't get to test with it so thank you for bringing this to my attention.

In this context, I don't see any significant penalty for false positives, so sure, your suggested regex /^[a-z]{3}_[a-zA-Z]{3,4}$/ would be more inclusive for different language code formats.

I appreciate your contribution and will incorporate this improvement. Thank you!

Thorry84 commented 1 week ago

Thanks so much for your work! <3