facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.83k stars 4.71k forks source link

Question: how to use model.predict with a specific threshold #1275

Open miguelusque opened 2 years ago

miguelusque commented 2 years ago

Hi,

I am trying to detect the language form a file containing the following sentences:

Esto es español Esto es una prueba This is English Esto es English This is una prueba Hola, ¿qué tal estás?

I am using lid.176.bin model, available here.

When I invoke .predict() with k=1 and threshold=0.0, I receive the following output:

([['labeles'], ['labeles'], ['labelen'], ['labeles'], ['labeles'], ['labeles']], [array([0.99675655], dtype=float32), array([0.9974708], dtype=float32), array([0.9877303], dtype=float32), array([0.9194267], dtype=float32), array([0.78018343], dtype=float32), array([0.99880266], dtype=float32)])

Please, notice that there is a prediction with probability 0.78018343.

If I would like to filter the results with probability greater or equal to 0.8, I invoke the .predict() method with threshold=0.8, obtaining the following results:

([['labeles'], ['labeles'], ['labelen'], ['labeles'], [], ['labeles']], [array([0.99675655], dtype=float32), array([0.9974708], dtype=float32), array([0.9877303], dtype=float32), array([0.9194267], dtype=float32), array([], dtype=float32), array([0.99880266], dtype=float32)])

My use case is to keep the sentences with a specific threshold. With .predict() method when using a certain threshold, it is not easy to link the input values to the results returned by .predict() (or at least, I do not know how to do it). It might be useful to include a parameter that would make .predict() to return not only the labels and probabilities, but also the input related with those results.

I am aware that I could use threshold=0.0 and loop the results and filter manually, but, at least in my case, the results from .predict() with multiple sentences when using threshold parameter are not very useful for me (or, maybe, I am likely missing something).

Any help would be welcomed. Thanks!!!

Miguel