davidberenstein1957 / classy-classification

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.
MIT License
211 stars 15 forks source link

Inconsistent Result while using a fix random seed #20

Closed swageeth closed 1 year ago

swageeth commented 1 year ago

I have been using Spacy - Classy Classification to classify text messages. Python version 3.10

Below is the the training model and I get the Unknown category with the highest score for this specific message:

#Import training data
with open ('SID - Commercial.txt', "r", encoding="utf8") as a:
    Commercial = a.read().splitlines()

with open ('SID - Crypto.txt', "r", encoding="utf8") as b:
    Crypto = b.read().splitlines()

with open ('SID - Extortion.txt', "r", encoding="utf8") as c:
    Extortion = c.read().splitlines()

with open ('SID - Financial.txt', "r", encoding="utf8") as d:
    Financial = d.read().splitlines()

with open ('SID - Gambling.txt', "r", encoding="utf8") as e:
    Gambling = e.read().splitlines()

with open ('SID - Gift.txt', "r", encoding="utf8") as f:
    Gift = f.read().splitlines()

with open ('SID - Investment.txt', "r", encoding="utf8") as g:
    Investment = g.read().splitlines()    

with open ('SID - Invoice.txt', "r", encoding="utf8") as h:
    Invoice = h.read().splitlines()  

with open ('SID - Phishing.txt', "r", encoding="utf8") as i:
    Phishing = i.read().splitlines() 

with open ('SID - Romance.txt', "r", encoding="utf8") as j:
    Romance = j.read().splitlines() 

with open ('SID - Unknown.txt', "r", encoding="utf8") as k:
    Unknown = k.read().splitlines() 

data = {}
data["Commercial"] = Commercial
data["Crypto"] = Crypto
data["Extortion"] = Extortion
data["Financial"] = Financial
data["Gambling"] = Gambling
data["Gift"] = Gift
data["Investment"] = Investment
data["Invoice"] = Invoice
data["Phishing"] = Phishing
data["Romance"] = Romance
data["Unknown"] = Unknown

# NLP model
spacy.util.fix_random_seed(0)
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("text_categorizer", 
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "cat_type": "multi-label",
        "device": "gpu"
    }
)

print(nlp("FW: 𝚁𝙴: 𝚈𝚘𝚞 𝚑𝚊𝚟𝚎 𝚘𝚗𝚎 (𝟷) 𝚘𝚛𝚍𝚎𝚛 𝚙𝚎𝚗𝚍𝚒𝚗𝚐 𝚍𝚎𝚕𝚒𝚟𝚎𝚛𝚢. #622460835")._.cats)

Result (which is correct as the Unknown as the highest score): {'Commercial': 0.13948287736862833, 'Crypto': 0.015437351941468657, 'Extortion': 0.0860014895963152, 'Financial': 0.01987490991768424, 'Gambling': 0.029074990906618126, 'Gift': 0.06850244399154756, 'Investment': 0.012729882351053419, 'Invoice': 0.0718818617408037, 'Phishing': 0.046637490542787444, 'Romance': 0.05515818363916855, 'Unknown': 0.45521851800392493}

Importing test dataset:

# Import the dataset and assign scores
Messages = pd.read_csv('November SID2.csv', encoding='utf8')

Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

#Find the category based on highest score and join back to the original dataset

Scores = pd.json_normalize(Messages.NLP_Result)
Scores['Category'] = Scores.idxmax(axis=1)
Scores['Category'] = Scores['Category'].replace('_', ' ', regex=True)

Messages_Final = pd.concat([Messages, Scores], axis=1)
Messages_Final.to_csv('out.csv', index=False)

The result in csv file for that same statement:

Body | NLP_Result | Commercial | Crypto | Extortion | Financial | Gambling | Gift | Investment | Invoice | Phishing | Romance | Unknown | Category -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- FW: 𝚁𝙴: 𝚈𝚘𝚞 𝚑𝚊𝚟𝚎 𝚘𝚗𝚎 (𝟷) 𝚘𝚛𝚍𝚎𝚛 𝚙𝚎𝚗𝚍𝚒𝚗𝚐 𝚍𝚎𝚕𝚒𝚟𝚎𝚛𝚢. #622460835 | {'Commercial': 0.03343028275903707, 'Crypto': 0.012076486026176284, 'Extortion': 0.08983918751534335, 'Financial': 0.07360790896376578, 'Gambling': 0.014564933067751274, 'Gift': 0.08460245841797985, 'Investment': 0.017324353297565327, 'Invoice': 0.1522007262418396, 'Phishing': 0.4507937431127887, 'Romance': 0.010566873139864728, 'Unknown': 0.060993047457888194} | 0.03343 | 0.012076 | 0.089839 | 0.073608 | 0.014565 | 0.084602 | 0.017324 | 0.152201 | 0.450794 | 0.010567 | 0.060993 | Phishing

Why are they inconsistent even when my training model has 'spacy.util.fix_random_seed(0)'?

Thank you

davidberenstein1957 commented 1 year ago

This is the case because the classifiers are not spacyspecific. They use torch or scikit-learn.

swageeth commented 1 year ago

This is the case because the classifiers are not spacyspecific. They use torch or scikit-learn.

Thank you, did you mean underlying spacy library is using torch and scikit-learn? Because I haven't imported either of those for this project. Is there a way I can make it consistent by any chance?

davidberenstein1957 commented 1 year ago

Yes indeed, it uses them internally and asigns the predictions to a spacy.Doc under doc._.cats.

On Wed, Dec 7, 2022 at 8:05 AM swageeth @.***> wrote:

This is the case because the classifiers are not spacyspecific. They use torch or scikit-learn.

Thank you, did you mean underlying spacy library is using torch and scikit-learn? Because I haven't imported either of those for this project. Is there a way I can make it consistent by any chance?

— Reply to this email directly, view it on GitHub https://github.com/Pandora-Intelligence/classy-classification/issues/20#issuecomment-1340489357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAZHZCSUGP7TXZBFQV5TIDWMAZKPANCNFSM6AAAAAASWNTUFE . You are receiving this because you commented.Message ID: @.*** com>

davidberenstein1957 commented 1 year ago

@swageeth does it make sense to you to align the seed with the spacy seed?

swageeth commented 1 year ago

@swageeth does it make sense to you to align the seed with the spacy seed?

how can I make it align with the spacy seed in my code?

davidberenstein1957 commented 1 year ago

As of now, that isn`t possible, but I can include it in a next release.

swageeth commented 1 year ago

As of now, that isn`t possible, but I can include it in a next release.

yes, please

davidberenstein1957 commented 1 year ago

@swageeth I will take a look the week after Christmas.

davidberenstein1957 commented 1 year ago

@swageeth this was handled in the new 0.5.4 release. Happy Holidays 🎄

swageeth commented 1 year ago

Thanks @davidberenstein1957. Happy holidays to you too! :)