Keith-Cheung / FIN377-Final-Project

0 stars 0 forks source link

Feedback on init proposal #1

Open donbowen opened 4 months ago

donbowen commented 4 months ago

@Keith-Cheung @dannyhubert9 @jls224

This is a great and ambitious question. I like the start you have on data. A few thoughts below

For the revision, you'll

This kind of project is probably best suited for a report/website, rather than a dashboard.


Some other things that might help:

import statsmodels.api as sm
import matplotlib.pyplot as plt
crime_data = sm.datasets.statecrime.load_pandas()
# if the dataset has too many dots, sample or use alpha
sm.graphics.plot_partregress(endog='murder', exog_i='hs_grad',
                             exog_others=['urban', 'poverty', 'single'],
                             data=crime_data.data, obs_labels=False)
plt.show()

image

import pandas as pd import statsmodels.formula.api as smf from plotnine import from binsreg import

######################################

Read the same data

####### used for STATA ################ #######################################

data = pd.read_csv("binsreg_sim.csv") data.describe().T

########################################################

EXTERNAL PLOT USING BINSREG OUTPUT

########################################################

Run binsreg (w is all the other variables you want in the regression)

est = binsreg('y', 'x', 'w', data=data,line = (3,3), ci=(3,3), cb=(3,3), polyreg=4)

Extract the plotting information

result = est.data_plot[0]

Create the figure to plot

fig = ggplot() + labs(x='X',y ='Y')

Add the dots

fig += geom_point(data=result.dots, mapping=aes(x='x', y='fit'), color="blue", size=2, shape='o')

Add the line

fig += geom_line(data=result.line, mapping=aes(x='x', y='fit'), color="blue", size=0.5)

Add the CI

fig += geom_errorbar(data=result.ci, mapping=aes(x='x', ymin='ci_l', ymax='ci_r'), color="blue", size=0.5, width = 0.02, linetype='solid')

Add the CB

fig += geom_ribbon(data=result.cb, mapping=aes(x='x', ymin='cb_l', ymax='cb_r'), fill="blue", alpha=0.2)

Add the polyreg

fig += geom_line(data=result.poly, mapping=aes(x='x', y='fit'), color="red", size=0.5)

Display the plot

print(fig)



The result:
![image](https://github.com/Keith-Cheung/FIN377-Final-Project/assets/50885867/a395eb22-7368-451c-ac29-bd0a8dbf571c)

[binsreg_sim.csv](https://github.com/Keith-Cheung/FIN377-Final-Project/files/15000859/binsreg_sim.csv)
dannyhubert9 commented 4 months ago

Hi @donbowen, Sorry for the delay! Here are the answers to your questions:

(1) Regression Model: We plan to construct a regression model to examine the relationship between the quantity of startups developed (dependent variable, y) and the number of people involved in Work From Home (WFH) over the years (independent variable, X). The y variable, representing the number of startups developed annually, will be our focal point, while the X variable, representing the size of WFH, will serve as our predictor. This model aims to elucidate how changes in WFH affect the creation of startups.

(2) Main X Variables: Our primary independent variable (X) is the amount of WFH from each year within our sample period (2015-2023). We will use this variable to gauge the level of WFH across different time periods and its potential impact on startup creation. Other optional variables, such as sentiment indicators regarding WFH preferences, may provide additional insights into the dynamics between WFH and startup development.

(3) Unit of Observation: The unit of observation for our regression model will be individual startups. Each startup within our dataset represents a unique case that contributes to our analysis. By examining the relationship between WFH and startups at the individual startup level, we aim to capture the nuances of this relationship and draw meaningful conclusions.

We are eager to receive your feedback and further guidance on our regression model specifications. Please feel free to tag us in your reply so that we can promptly address any additional questions or concerns.

Thank you for your continued support and assistance.

Best regards,

Danny Hubert, Josh Simon, Keith Cheung.

donbowen commented 4 months ago

The unit is a startup? How is that possible if the dependent variable is the number of startups? A startup always creates 1 startup...

Is the dependent variable the number of startups...

It depends on which datasource you're starting with. Your WFH variable (and most or all other X variables) should be measured at the same unit level.

Let's talk about this in class, before or after.

jls224 commented 4 months ago

We will be working on the number of startups by state, with an observation unit being that state at a point in time. We are accruing datasets that have startup frequency trends and percentages of WFH by state (in some cases cities but can be included by cleaning up the data into the states), both using census data and Nick Bloom, in order to create our regression model. We will not be using specific number of people who work from home, instead we will be using a percentage of the workforce of that year that work from home by state.