current12 / Stat-222-Project

3 stars 0 forks source link

EDA on All Data (merged data file) #24

Closed ijyliu closed 5 months ago

ijyliu commented 6 months ago

Build on the work in: https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data%20EDA.ipynb

Items from #7, #8, #9

image

NLP EDA

Additional

ijyliu commented 6 months ago

send pdf of notebook run on full data

and notebook code of eda

and sample dataset

write simple description of data cleaning

cc TA Zhexiao

ijyliu commented 6 months ago

@current12 actually, can you make a separate notebook that does the NLP EDA on the transcript features? it's very slow to run them in the main notebook. we can just add it as an optional thing if they want to look at them.

Here's the code to load the dataset. You can focuse on the Transcript column.

# Load in parquet file
# ~\Box\STAT 222 Capstone\Intermediate Data\all_data.parquet
df = pd.read_parquet(r'~\Box\STAT 222 Capstone\Intermediate Data\all_data.parquet')
ijyliu commented 6 months ago

@current12

Actually I'm finished with my part of

https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data%20EDA.ipynb

so if you want to do NLP and other stuff at the bottom of the same notebook, that's fine.

I left you the NLP stuff and also checking the tabular financial statement variables more carefully since you are most familiar with those variables. there's also code setup at the top of the notebook to switch the code over to using a sample dataset - since the actual data is 200MB+ and that's too much to send, we can submit with all_data_sample.csv, which is 100 firms.

ijyliu commented 6 months ago

if it's not too hard/doesn't take long to re-run, i'd comment out the financial statements unit correction in this EDA file so that they can see the true values (even if they're weird-looking)

also, can you do transcript length in words, not characters? (or both, but words is more meaningful) this applies to both the mean calculation and the distribution plot

otherwise, looks good to me! make sure the final pdf we turn in is run on the full data.

ijyliu commented 6 months ago

for number of words, I suggest just the standard nltk tokenizer, then strip punctuation tokens and count

current12 commented 6 months ago

for number of words, I suggest just the standard nltk tokenizer, then strip punctuation tokens and count

I have uploaded the version with number of words. But I didn't comment out the financial statements unit correction.

ijyliu commented 6 months ago

if you're not going to comment it out, can you at least show the summary statistics before and after so there's a record of the original data?

ijyliu commented 6 months ago

i'm also a little concerned because it looks like, for example, we obliterated all observations in 2010. so our year plot is inaccurate if we don't comment it out

ijyliu commented 6 months ago

you should fix it to not affect these

image

and only the other variables

current12 commented 6 months ago

I think it's not due to the financial statements unit correction. You can see the full dataset pdf. there are 2010

image

I think it's due to the lack in the original sample data

ijyliu commented 6 months ago

ok. still suggest adding summary stats before you correct it though

current12 commented 6 months ago

np

ijyliu commented 6 months ago

nice job, once that's done i'd say good to submit

anyone else with code comments/who want's to review, speak quickly

current12 commented 6 months ago

and I removed the correction part

current12 commented 6 months ago

I have uploaded the latest eda notebook and pdf.

ijyliu commented 6 months ago

@current12 i'm going to continue working on the remaining things here. feel free to work on the other issues you're on

ijyliu commented 5 months ago

split company dropout into a new issue #29