current12 / Stat-222-Project

3 stars 0 forks source link

NLP Features #20

Closed seanzhou1207 closed 4 months ago

seanzhou1207 commented 6 months ago
ijyliu commented 6 months ago

For embeddings, I'd suggest finbert and also possibly a specific transformer classifier finetuned for this task

First link on positive v. negative words doesn't work. Sounds pretty straightforward to me though. I'd say you can go ahead and download that dictionary because that's a baseline we will definitely use.

How do you plan to use Loughran McDonald finance dictionary? What's it's value add in addition to just the positive/negative words? How easy is that to download and use?

What is Alexandria?

For tone, what dictionaries would you plan to use to get values of active/passive, etc for the tone PCA?

I think for analyst engagement we may be limited to just counting question marks overall because it will be hard to parse out the questions segment.

Have you looked into the textstat package yet? I think that will help with feature 4. There are also other easy text features it's capable of generating easily.

I'd possibly suggest trimming the number of dictionaries and simple features in favor of spending more time on neural approaches. It seems like the sentiment positivity feature is kind of subsumed in the tone feature, so I would pick just one of those. I'd bet the paper for the tone feature has a comparison with just the ratio of positive/negative words vs. throwing it plus other stuff into PCA. Is the performance gain from PCA vs. just that ratio large? PCA would be less interpretable and require more work and dictionaries.

ijyliu commented 6 months ago

@OwenLin2001 @current12 what do you think? we can also send to libor once we've refined

ijyliu commented 6 months ago

image

ijyliu commented 6 months ago

@seanzhou1207

I'd get started coding these (starting with the dictionaries and simple ones) if you haven't already

You can make a new notebook and test/debug on all_data_sample.csv and then scale up to all_data.parquet

Then you can save as all_data_with_nlp.parquet or something

ijyliu commented 6 months ago

input file is now all_data_fixed_quarter_dates.parquet

please output all_data_fixed_quarter_dates_NLP.parquet or something (parquet load time and upload/download speeds are much faster, storage space is less than half as much (my computer is filling up 💀 + we might be approaching Box limits), and it will also be super easy to load just the feature columns and not the call transcript itself in future)

ijyliu commented 5 months ago

parallelizing feature construction:

this is an embarassingly parallel situation - each row of the data/each transcript is entirely independent. so we have lots of options

ijyliu commented 5 months ago

hey @seanzhou1207 do you anticipate having missing values for any of these constructed NLP variables - like if the ratio has zero in the denominator?

seanzhou1207 commented 5 months ago

hey @seanzhou1207 do you anticipate having missing values for any of these constructed NLP variables - like if the ratio has zero in the denominator?

yes, but less than 5 missing values out of all 7000

ijyliu commented 5 months ago

Use: https://pypi.org/project/finbert-embedding/

seanzhou1207 commented 5 months ago
ijyliu commented 5 months ago

Added "Store average FINBERT embeddings for each call as well, may use them" as a feature

seanzhou1207 commented 5 months ago
image

So far finished all features except for the "tone". Issue: all words from the Harvard dictionary are non-stemmed and some have multiple meanings. It's hard to map the words in earning call to the dictionary.

ijyliu commented 5 months ago

just do the best you can. if we do eda and discover it's not helpful we can just ignore it

ijyliu commented 5 months ago

@seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

ijyliu commented 5 months ago

@seanzhou1207

what is the difference between these two variables

image

seanzhou1207 commented 5 months ago

@seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

No, FINEBRT was taking too long so I ended up getting positive and negative word counts using the Harvard dictionary as well.

seanzhou1207 commented 5 months ago

@seanzhou1207

what is the difference between these two variables

image

No difference. sorry

ijyliu commented 5 months ago

I think finbert is pretty critical. For autogluon at least the sentiment doesn't perform super good. And I think the profs expect us to use it since that was pretty key to the project topic.

Just do it in a .py file and request an A100. I'll go get an sbatch for you to use for it. If you can't figure out how to do it on GPU with the package we can go get it from huggingface

On Mon, Apr 1, 2024, 10:47 PM Xiaoyu (Sean) Zhou @.***> wrote:

@seanzhou1207 https://github.com/seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

No, FINEBRT was taking too long so I ended up getting positive and negative word counts using the Harvard dictionary as well.

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/20#issuecomment-2031126498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4LDUDC7DX4LQZZTK6DY3JBARAVCNFSM6AAAAABDTWFM5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGEZDMNBZHA . You are receiving this because you commented.Message ID: @.***>

ijyliu commented 5 months ago

heres a shell script

https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data/EDA_NER.sh

ijyliu commented 5 months ago

@seanzhou1207 what is the difference between these two variables image

No difference. sorry

ok, i'm going to delete readability then

ijyliu commented 4 months ago

continued finbert construction:

ijyliu commented 4 months ago

finbert sentiment analysis complete, average finbert embeddings moved to issue #81