LlmKira / fast-langdetect

⚡️ 80x faster language detection with Fasttext | Split text by language for TTS
MIT License
101 stars 4 forks source link

Bug: FastText Incompatibility with NumPy >= 2.0.0 #3

Closed myhloli closed 2 months ago

myhloli commented 2 months ago
______________________________________________________________________ test_detect_totally _______________________________________________________________________

    def test_detect_totally():
        from fast_langdetect import detect_language
>       assert detect_language("hello world") == "EN", "ft_detect error"

tests/test_detect.py:25: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
venv/lib/python3.10/site-packages/fast_langdetect/ft_detect/__init__.py:23: in detect_language
    lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
venv/lib/python3.10/site-packages/fast_langdetect/ft_detect/infer.py:81: in detect
    labels, scores = model.predict(text)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <fasttext.FastText._FastText object at 0x10ecdca90>, text = 'hello world\n', k = 1, threshold = 0.0, on_unicode_error = 'strict'

    def predict(self, text, k=1, threshold=0.0, on_unicode_error='strict'):
        """
        Given a string, get a list of labels and a list of
        corresponding probabilities. k controls the number
        of returned labels. A choice of 5, will return the 5
        most probable labels. By default this returns only
        the most likely label and probability. threshold filters
        the returned labels by a threshold on probability. A
        choice of 0.5 will return labels with at least 0.5
        probability. k and threshold will be applied together to
        determine the returned labels.

        This function assumes to be given
        a single line of text. We split words on whitespace (space,
        newline, tab, vertical tab) and the control characters carriage
        return, formfeed and the null character.

        If the model is not supervised, this function will throw a ValueError.

        If given a list of strings, it will return a list of results as usually
        received for a single line of text.
        """

        def check(entry):
            if entry.find('\n') != -1:
                raise ValueError(
                    "predict processes one line at a time (remove \'\\n\')"
                )
            entry += "\n"
            return entry

        if type(text) == list:
            text = [check(entry) for entry in text]
            all_labels, all_probs = self.f.multilinePredict(
                text, k, threshold, on_unicode_error)

            return all_labels, all_probs
        else:
            text = check(text)
            predictions = self.f.predict(text, k, threshold, on_unicode_error)
            if predictions:
                probs, labels = zip(*predictions)
            else:
                probs, labels = ([], ())

>           return labels, np.array(probs, copy=False)
E           ValueError: Unable to avoid copy while creating an array as requested.
E           If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
E           For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

venv/lib/python3.10/site-packages/fasttext/FastText.py:232: ValueError

https://github.com/facebookresearch/fastText has been archived, I just add "numpy<2.0.0" in my requirements.txt.

sudoskys commented 2 months ago

location: https://github.com/LlmKira/fast-langdetect/commit/17b159ae7ee2eaa33dcf4014810644a12a0cb6b4#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711

"numpy>=1.26.4,<2.0.0"

neutron-nerve[bot] commented 2 months ago

Issue Report: Bug - FastText Incompatibility with NumPy >= 2.0.0

Issue Summary

An issue was identified in the fast_langdetect library where the FastText model was incompatible with NumPy versions greater than or equal to 2.0.0. The specific error encountered was:

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).

The detected incompatibility caused the unit test test_detect_totally to fail when attempting to detect language using the FastText model, due to changes in NumPy 2.0.0's handling of array creation.

Root Cause

The error occurred because FastText used np.array with the copy=False parameter, which is not supported in NumPy 2.0.0 as per the migration guide. This made the code incompatible with newer versions of NumPy.

Resolution

To resolve the incompatibility, the project's requirements were updated to restrict the version of NumPy to less than 2.0.0. Specifically, the following change was made to the requirements.txt file:

numpy>=1.26.4,<2.0.0

This adjustment ensures that the project remains compatible with NumPy versions that do not introduce the breaking change.

Final Outcome

The issue was successfully resolved by the contributor @sudoskys. The project's requirements now specify an appropriate range for the NumPy version, avoiding the incompatibility with NumPy 2.0.0 and ensuring stable functionality for fast_langdetect.

Appreciations

We extend our gratitude to @sudoskys for promptly addressing this issue and providing a solution. The community's swift action ensures the continued reliability and performance of the fast_langdetect library.


Report Prepared By:
LlmKira Contributors