NLP Features - Githubissues

seanzhou1207 commented 6 months ago

[x] Sentiment analysis: FINBERT source
[x] Sentiment analysis (dictionary) - formula above but using positive and negative word counts from Harvard dictionary
[ ] Store average FINBERT embeddings for each call as well, may use them
[x] Overall tone: source
[x] Numeric transparency: ratio of numbers to words source
[x] Engagement - Number of Question Marks
[x] Gunning-Fog Readability

[x] Call length in words. EDA evidence shows this is likely to be a good predictor. Please adapt the code below.

# Tokenize the text
df['transcript_tokens'] = df['transcript'].apply(word_tokenize)
# Words: strip punctuation and make lowercase
# Regex for not being solely punctuation
df['transcript_words'] = df['transcript_tokens'].apply(lambda x: [token for token in x if re.search(r'^[^\w]+$', token) == None])

ijyliu commented 6 months ago

For embeddings, I'd suggest finbert and also possibly a specific transformer classifier finetuned for this task

First link on positive v. negative words doesn't work. Sounds pretty straightforward to me though. I'd say you can go ahead and download that dictionary because that's a baseline we will definitely use.

How do you plan to use Loughran McDonald finance dictionary? What's it's value add in addition to just the positive/negative words? How easy is that to download and use?

What is Alexandria?

For tone, what dictionaries would you plan to use to get values of active/passive, etc for the tone PCA?

I think for analyst engagement we may be limited to just counting question marks overall because it will be hard to parse out the questions segment.

Have you looked into the textstat package yet? I think that will help with feature 4. There are also other easy text features it's capable of generating easily.

I'd possibly suggest trimming the number of dictionaries and simple features in favor of spending more time on neural approaches. It seems like the sentiment positivity feature is kind of subsumed in the tone feature, so I would pick just one of those. I'd bet the paper for the tone feature has a comparison with just the ratio of positive/negative words vs. throwing it plus other stuff into PCA. Is the performance gain from PCA vs. just that ratio large? PCA would be less interpretable and require more work and dictionaries.

ijyliu commented 6 months ago

@OwenLin2001 @current12 what do you think? we can also send to libor once we've refined

ijyliu commented 6 months ago

@seanzhou1207

I'd get started coding these (starting with the dictionaries and simple ones) if you haven't already

You can make a new notebook and test/debug on all_data_sample.csv and then scale up to all_data.parquet

Then you can save as all_data_with_nlp.parquet or something

ijyliu commented 6 months ago

input file is now all_data_fixed_quarter_dates.parquet

please output all_data_fixed_quarter_dates_NLP.parquet or something (parquet load time and upload/download speeds are much faster, storage space is less than half as much (my computer is filling up 💀 + we might be approaching Box limits), and it will also be super easy to load just the feature columns and not the call transcript itself in future)

ijyliu commented 5 months ago

parallelizing feature construction:

this is an embarassingly parallel situation - each row of the data/each transcript is entirely independent. so we have lots of options

Continue to work locally and use dask, modin, other python libraries for multi-core. Split features into a separate script/notebook so they can be run independently - try to separate fast from slow code
SCF. You can apply for an individual account and they will give you access quickly. it's then simple to parallelize using the SLURM scheduler (I have experience with this). The only real pain point is that we will have to find a way to move data to/from Box and SCF. Several options here, such as the Box API (ask Owen? you could use the api to read/write the df to Box from within your script I think) which we've already been using, or rclone/rsync (a shell script to save the data to your home directory or a shared project directory and keep it up to date). it might also take a little time to read the SCF documentation on conda environments, thankfully we have environment.yml in this repo already, so that will make it easy to build
AWS/various other cloud compute platforms - this is actually more bureaucratic and difficult than I envisioned and hence not recommended... we can still use them for autoML and/or actual training, but don't suggest getting into it for feature construction

ijyliu commented 5 months ago

hey @seanzhou1207 do you anticipate having missing values for any of these constructed NLP variables - like if the ratio has zero in the denominator?

seanzhou1207 commented 5 months ago

hey @seanzhou1207 do you anticipate having missing values for any of these constructed NLP variables - like if the ratio has zero in the denominator?

yes, but less than 5 missing values out of all 7000

ijyliu commented 5 months ago

Use: https://pypi.org/project/finbert-embedding/

seanzhou1207 commented 5 months ago

Visualization for the NLP features
Use Harvard IV-4 Psychosocial to get the other tones
Use finbert embedding to get positivity score
Count question marks to get analyst engagement
Combine all NLP features into 1 parquet file

ijyliu commented 5 months ago

Added "Store average FINBERT embeddings for each call as well, may use them" as a feature

seanzhou1207 commented 5 months ago

So far finished all features except for the "tone". Issue: all words from the Harvard dictionary are non-stemmed and some have multiple meanings. It's hard to map the words in earning call to the dictionary.

ijyliu commented 5 months ago

just do the best you can. if we do eda and discover it's not helpful we can just ignore it

ijyliu commented 5 months ago

@seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

ijyliu commented 5 months ago

@seanzhou1207

what is the difference between these two variables

seanzhou1207 commented 5 months ago

@seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

No, FINEBRT was taking too long so I ended up getting positive and negative word counts using the Harvard dictionary as well.

seanzhou1207 commented 5 months ago

@seanzhou1207

what is the difference between these two variables

No difference. sorry

ijyliu commented 5 months ago

I think finbert is pretty critical. For autogluon at least the sentiment doesn't perform super good. And I think the profs expect us to use it since that was pretty key to the project topic.

Just do it in a .py file and request an A100. I'll go get an sbatch for you to use for it. If you can't figure out how to do it on GPU with the package we can go get it from huggingface

On Mon, Apr 1, 2024, 10:47 PM Xiaoyu (Sean) Zhou @.***> wrote:

@seanzhou1207 https://github.com/seanzhou1207 did you save finbert embeddings or average finbert embeddings for each call? not seeing them

i can go make them. they seem like they might be directly useful

No, FINEBRT was taking too long so I ended up getting positive and negative word counts using the Harvard dictionary as well.

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/20#issuecomment-2031126498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4LDUDC7DX4LQZZTK6DY3JBARAVCNFSM6AAAAABDTWFM5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGEZDMNBZHA . You are receiving this because you commented.Message ID: @.***>

ijyliu commented 5 months ago

heres a shell script

https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data/EDA_NER.sh

ijyliu commented 5 months ago

@seanzhou1207 what is the difference between these two variables

No difference. sorry

ok, i'm going to delete readability then

ijyliu commented 4 months ago

continued finbert construction:

split into more separate programs + do smaller numbers of calls at a time
make sure you have a separate log for each split/program so you will know if any specific one is failing
spend a little time trying to see if more dask internals can be printed out (like progress, etc.)

ijyliu commented 4 months ago

finbert sentiment analysis complete, average finbert embeddings moved to issue #81

current12 / Stat-222-Project

NLP Features #20