Vis4Sense / student-projects

Repository for all the UG and PGT final projects
3 stars 3 forks source link

LLM trading - sentiment analysis comparison #61

Open kaidatavis opened 12 months ago

kaidatavis commented 12 months ago

Sentiment benchmark

(please create a table to compare the different options following the format in #59 and #60)

kaidatavis commented 12 months ago

Sentiment performance of different LLMs

See #62 for a list of LLMs.

Please create a table to compare the different options following the format in #59 and #60.

Please fill the table with 1) information and 2) a link to the source if available.

erwin27 commented 12 months ago

LLMs Benchmark on Financial News Dataset

Test dataset: Financial Phrasebank English sentences from financial news, and classified as either positive, negative, or neutral by researchers knowledgeable in the finance domain.

This benchmark using 1,000 sample records (random seed = 42) from "sentence_allagree" subset. The distribution of the test set is same as the subset population (Neutral: 61.4%, Positive: 25,2%, Negative: 13,4%)

dataset example: Sentence Label
The mall is part of the Baltic Pearl development project in the city of St Petersburg , where Baltic Pearl CJSC , a subsidiary of Shanghai Foreign Joint Investment Company , is developing homes for 35,000 people . 1 neutral
In the reporting period , net sales rose by 8 % year-on-year to EUR64 .3 m , due to the business acquisitions realized during the first half of 2008-09 , the effect of which was EUR10 .9 m in the review period . 2 positive
Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing . 0 negative

LLMs Model tested are from gpt4all.io (4-bit Quantization)

Hardware (Laptop) specs used for test: Windows 11 CPU: 12th Gen Intel i7-12700H 4.70 GHz RAM: 32GB 4800Mhz DDR5 GPU: Nvidia RTX 3070 Ti 8Gb Storage: 1TB M.2 SSD

Testing Environment Python 3.11.5 gpt4all 2.0.2 torch 2.1.1+cu121

Zero-Shot Benchmark Result (on progress) Result Files Model Accuracy Run Time avg time/iter Spec n requirement: Param/size/RAM
gpt4all-13b-snoozy-q4_0 0.3670 5hr 37min 20.28s 13b/6.86 GB/16 GB
nous-hermes-llama2-13b.Q4_0 0.6460 5hrs 42min 20.52s 13b/6.86 GB/16 GB
wizardlm-13b-v1.2.Q4_0 0.8170 5hr 36min 20.18s 13b/6.86 GB/16 GB
orca-2-13b.Q4_0 0.8820 5hrs 43min 20.56s 13b/6.86 GB/16 GB
orca-2-7b.Q4_0 0.5050 2hrs 53min 10.37s 7b/3.56 GB/8 GB
mistral-7b-openorca.Q4_0 0.6060 3hrs 1min 10.88s 7b/3.83 GB/8 GB
mistral-7b-instruct-v0.1.Q4_0 0.5520 3hrs 5min 11.10s 7b/3.83 GB/8 GB
gpt4all-falcon-q4_0 0.1630 2hrs 49min 10.13s 7b/3.92 GB/8 GB
mpt-7b-chat-merges-q4_0 0.1340 2hrs 31min 9.07s 7b/3.54 GB/8 GB
orca-mini-3b-gguf2-q4_0 0.3990 1hr 27min 5.25s 3b/1.84 GB/4 GB
kaidatavis commented 12 months ago

@erwin27, this is a very good start. Please keep filling the table as more results become available.

Did you have a chance to discuss with Xiruo about distribute the work? Maybe she can test some of the models while you are testing the others?

Finally, it will be good to test the 'strength' of the sentiment besides positive, negative, and neutral. For example, there can be 'very positive' and 'a little positive', which translate to 1.8 or 1.2 sentiment score if 2 is the most positive.

erwin27 commented 12 months ago

@kaidatavis , yes me and Xiruo have discussed several times already since monday. Sure, we will continue until the rest of the model and may be finding another financial related dataset if possible. We also will try your recommendation for the sentiment analysis strength as that would be huge impact for the project if latter we implement it. Thank you @kaidatavis