davidberenstein1957 / classy-classification

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.
MIT License
208 stars 15 forks source link

Multilabel returns scientific notation with big dataset #19

Closed andremacola closed 1 year ago

andremacola commented 1 year ago

The dataset has over 3000 sentences with labels in each category.

This causes inconsistency when you break the text into sentences and perform calculations on the final score.

And I still have doubts if the scientific number is represented correctly. I have to do some additional tests

f = open("../dumps/classifier.few.pkl", "rb")
f = open("../dumps/classifier.multilabel.pkl", "rb")
classifier = pickle.load(f)
classifier = classifier("Em jogo decisivo na Colômbia, América-MG enfrenta Tolima e busca primeira vitória na fase de grupos da Libertado. Saiba onde assistir a Flamengo x Goiás pelo Brasileirão 2022")
print(classifier)
{'Arte e Entretenimento': 2.9263447e-06, 'Economia': 0.0006199917, 'Esporte': 0.99947697, 'Games': 8.631342e-06, 'Moda': 3.58305e-08, 'Politica': 0.0005317643, 'Pornografia': 1.0400859e-14, 'Saude': 7.639193e-09, 'Sexualidade': 0.00030366945, 'Tecnologia': 0.00012137198, 'Violencia e Crime': 1.5593606e-07}
tomaarsen commented 1 year ago

Hello!

The scientific notation is simply how Python will print a small number:

print(0.0000029263447)
2.9263447e-06

classy-classification isn't responsible for this scientific notation, it simply returned a really small number for some of the entries. If you want to print these values in the "normal" notation, then perhaps f-string formatting will come in handy:

prediction = {
    "Arte e Entretenimento": 2.9263447e-06,
    "Economia": 0.0006199917,
    "Esporte": 0.99947697,
    "Games": 8.631342e-06,
    "Moda": 3.58305e-08,
    "Politica": 0.0005317643,
    "Pornografia": 1.0400859e-14,
    "Saude": 7.639193e-09,
    "Sexualidade": 0.00030366945,
    "Tecnologia": 0.00012137198,
    "Violencia e Crime": 1.5593606e-07,
}
for category, likelihood in prediction.items():
    """
    Prepend a string with `f`, and then you can use brackets: {}
    In those brackets you can put variables or computations (like `likelihood * 100`)
    And you can use a `:`, after which you can tell Python how to print the value to
    the left of the colon.

    `<25` means "add padding to the right side until the string is at least 25 characters long"
    You could use `>25` if you want left-side padding instead.

    `.8f` means that we want to print this value as a float (hence `f`), and that we want exactly
    8 values after the dot.
    """
    print(f"{category:<25}: {likelihood * 100:.8f}%")

This outputs:

Arte e Entretenimento    : 0.00029263%
Economia                 : 0.06199917%
Esporte                  : 99.94769700%
Games                    : 0.00086313%
Moda                     : 0.00000358%
Politica                 : 0.05317643%
Pornografia              : 0.00000000%
Saude                    : 0.00000076%
Sexualidade              : 0.03036694%
Tecnologia               : 0.01213720%
Violencia e Crime        : 0.00001559%

Perhaps this output is a bit easier to read and understand.