OWASP-BLT / BLT

OWASP BLT is a bug logging tool to report issues and get points, companies are held accountable.
https://blt.owasp.org
GNU Affero General Public License v3.0
122 stars 123 forks source link

Select a pre-trained model or fine-tune a sentiment analysis model. #2329

Open DonnieBLT opened 2 weeks ago

Uttkarsh-raj commented 2 weeks ago

I think we can go with DistilBERT which is managed by Hugging Face and has many advantages over the generally used models for Sentiment analysis.

DistilBERT

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

BERT

Google developed BERT to serve as a bidirectional transformer model that examines words within text by considering both left-to-right and right-to-left contexts. It helps computer systems understand text as opposed to creating text, which GPT models are made to do. BERT excels at NLU tasks as well as performing sentiment analysis. It's ideal for Google searches and customer feedback.

How is BERT different from GPT

GPT models differ from BERT in both their objectives and their use cases. GPT models are forms of generative AI that generate original text and other forms of content. They're also well-suited for summarizing long pieces of text and text that's hard to interpret. BERT and other language models differ not only in scope and applications but also in architecture.

Sarthak5598 commented 2 weeks ago

I have two issues with this:

First, it requires a powerful PC or a premium server to use and train the model, as free trials can't handle the load.

Recommended Specifications for Better Performance:

The second issue is why use something complex when a simple machine learning project could suffice. The functionality is very basic, and with just logistic regression, I achieved 92% accuracy. By incorporating multiple layers of different machine learning algorithms, the accuracy could easily reach 98% or even higher.

Please share your thoughts on this @Uttkarsh-raj @DonnieBLT @arkid15r @AtmegaBuzz

Uttkarsh-raj commented 2 weeks ago

This is something we should definitely look into.. as for the training part Google colab do provide you the resources to train the model. I have trained a model on this and was able to achieve around 90% of accuracy but the accuracy completely depends on the data set selected. The DistillBert is able to understand the context of the sentence so if a new word or arrangement of words is encountered it can easily understand it which is not possible by training regression model. Also once trained you dont need to retrain the model on every request. I think we can see if we get any hosting platform for a minimum cost cause anyway we would have to host the model somewhere, but i guess we can try hosting it on the same server where the backend is hosted currently. Would definitely like the mentors opinion on this tho .

AtmegaBuzz commented 2 weeks ago

True, running a LLM just for simpler tasks will consume a lot of resources. Try tensorflow or pre built traditional models on GitHub.

Sarthak5598 commented 2 weeks ago

So, as for the update I increased the data set to almost 5k , I think focusing on that is really important other than that I tried using stacking ensemble learning(one of the three ways to use multiple models for one ) for this and if we use the right models I think we can do alot better and wouldn't need buy servers or anything and also training the model needs to done once here too , We can use joblib library to save the model and use it

Sarthak5598 commented 2 weeks ago

image I have achieved almost 96% accuracy but we need to work on improving the dataset

Uttkarsh-raj commented 2 weeks ago

I was trying to create a dataset out of the current issues in the server but there are some issues with this:

So the problem for the dataset still remains the same. @DonnieBLT what should we do regarding this?

Sarthak5598 commented 1 week ago

If you want to use that approach then the best idea would be web scrapping but not directly as you're doing it But rather we can provide users all the label options from now on and within a month or two we will get a good dataset And can be used to train But the issue is will it be enough because I created a dataset of about 5k and it's still not enough we would need around 10k and also my dataset is not that good as it is created by gpt .

Sarthak5598 commented 1 week ago

How about we add the labeling option as I said and after we get a dataset of about 2k we can use it and then we will also store the future bugs and add them to dataset and after we find the dataset and our model good enough we can use earlier bugs too , this will create us a big dataset Not sure if this is the best approach

Uttkarsh-raj commented 1 week ago

If you want to use that approach then the best idea would be web scrapping but not directly as you're doing it But rather we can provide users all the label options from now on and within a month or two we will get a good dataset And can be used to train But the issue is will it be enough because I created a dataset of about 5k and it's still not enough we would need around 10k and also my dataset is not that good as it is created by gpt .

I would have been in favor of this but currently the bugs being reported is mostly by anonymous users and that too about the blt app . Also not sure of the real traffic on the application currently. Also we cant perform web scraping cause some sites have proxies to prevent scraping on them and this can break the scraper too. We cant rely on only the bugs reported on blt for the dataset , we need to have other sources too . I tried looking for such dataset on Kaggle to but no success.

Sarthak5598 commented 1 week ago

There are no datasets online and if that's the thing then we will have to use gpt for it at least for now , One thing you can look into is jira if you can find its dataset , then that's more than enough (Research about what jira is first)

Uttkarsh-raj commented 1 week ago

I have worked with Jira before but didnt knew that it provided a dataset too.. Thanks for the info but i could only find this https://zenodo.org/records/5901804#:~:text=Description,using%20the%20Jira%20API%20V2.

Sarthak5598 commented 1 week ago

I don't think they share any dataset officially , It was just an idea if we can get hands on any dataset it would be helpful