donbowen commented 4 months ago

@Keith-Cheung @dannyhubert9 @jls224

This is a great and ambitious question. I like the start you have on data. A few thoughts below

It would be preferable if you can have more years of data. Maybe back to 2015 at least. It might not be possible for all the variables you need. You can make due. Having data pre-pandemic is useful, and Nick Bloom has data to this effect.
Look up Nick Bloom's work on WFH to see about major issues and possible data sources. He has posted data.
This is NOT an R2 project. It's not prediction or machine learning. It's a relationship question. You'll use regression tables (like in Asgn 6) and figures
So, since it's a regression study, please reply to this below with (1) your regression model, explaining the y variable (2) the main X variables you want to include and (3) what the unit of observation is. Tag me so I get a notification. After your reply, I'll provide more details.

For the revision, you'll

update the data to clarify things like the unit of observation, data sources
add a section called Methodology where you will explain the regression (or regression(s) a la assignment 6).

This kind of project is probably best suited for a report/website, rather than a dashboard.

Some other things that might help:

For plotting your regression results, statsmodels has some built in plotting functions. sm.graphics.plot_partregress will plot how y is related to a single x variable, even if you have other X variables. In the below, the plot is showing for the regression murder ~ hs_grad + urban + poverty + single:

import statsmodels.api as sm
import matplotlib.pyplot as plt
crime_data = sm.datasets.statecrime.load_pandas()
# if the dataset has too many dots, sample or use alpha
sm.graphics.plot_partregress(endog='murder', exog_i='hs_grad',
                             exog_others=['urban', 'poverty', 'single'],
                             data=crime_data.data, obs_labels=False)
plt.show()

If your dataset is huge, the other method for plotting is binsreg. It's a very smart way to visualize regressions, and a recent AER (a top econ journal) has a paper on it

The code for it, you'll need to install plotnine and binsreg and download the attached CSV:


################################################################################
# Binsreg: illustration file for plot
# Authors: M. D. Cattaneo, R. Crump, M. Farrell, Y. Feng and Ricardo Masini
# Last update: March  17, 2023
################################################################################

import pandas as pd import statsmodels.formula.api as smf from plotnine import from binsreg import

######################################

Read the same data

####### used for STATA ################ #######################################

data = pd.read_csv("binsreg_sim.csv") data.describe().T

########################################################

EXTERNAL PLOT USING BINSREG OUTPUT

########################################################

Run binsreg (w is all the other variables you want in the regression)

est = binsreg('y', 'x', 'w', data=data,line = (3,3), ci=(3,3), cb=(3,3), polyreg=4)

Extract the plotting information

result = est.data_plot[0]

Create the figure to plot

fig = ggplot() + labs(x='X',y ='Y')

Add the dots

fig += geom_point(data=result.dots, mapping=aes(x='x', y='fit'), color="blue", size=2, shape='o')

Add the line

fig += geom_line(data=result.line, mapping=aes(x='x', y='fit'), color="blue", size=0.5)

Add the CI

fig += geom_errorbar(data=result.ci, mapping=aes(x='x', ymin='ci_l', ymax='ci_r'), color="blue", size=0.5, width = 0.02, linetype='solid')

Add the CB

fig += geom_ribbon(data=result.cb, mapping=aes(x='x', ymin='cb_l', ymax='cb_r'), fill="blue", alpha=0.2)

Add the polyreg

fig += geom_line(data=result.poly, mapping=aes(x='x', y='fit'), color="red", size=0.5)

Display the plot

print(fig)



The result:
![image](https://github.com/Keith-Cheung/FIN377-Final-Project/assets/50885867/a395eb22-7368-451c-ac29-bd0a8dbf571c)

[binsreg_sim.csv](https://github.com/Keith-Cheung/FIN377-Final-Project/files/15000859/binsreg_sim.csv)

dannyhubert9 commented 4 months ago

Hi @donbowen, Sorry for the delay! Here are the answers to your questions:

(1) Regression Model: We plan to construct a regression model to examine the relationship between the quantity of startups developed (dependent variable, y) and the number of people involved in Work From Home (WFH) over the years (independent variable, X). The y variable, representing the number of startups developed annually, will be our focal point, while the X variable, representing the size of WFH, will serve as our predictor. This model aims to elucidate how changes in WFH affect the creation of startups.

(2) Main X Variables: Our primary independent variable (X) is the amount of WFH from each year within our sample period (2015-2023). We will use this variable to gauge the level of WFH across different time periods and its potential impact on startup creation. Other optional variables, such as sentiment indicators regarding WFH preferences, may provide additional insights into the dynamics between WFH and startup development.

(3) Unit of Observation: The unit of observation for our regression model will be individual startups. Each startup within our dataset represents a unique case that contributes to our analysis. By examining the relationship between WFH and startups at the individual startup level, we aim to capture the nuances of this relationship and draw meaningful conclusions.

We are eager to receive your feedback and further guidance on our regression model specifications. Please feel free to tag us in your reply so that we can promptly address any additional questions or concerns.

Thank you for your continued support and assistance.

Best regards,

Danny Hubert, Josh Simon, Keith Cheung.

donbowen commented 4 months ago

The unit is a startup? How is that possible if the dependent variable is the number of startups? A startup always creates 1 startup...

Is the dependent variable the number of startups...

in the US? (obs unit = a year, or a month, etc) (impossible to answer your question with this, your dataset will be ~8 rows)
in a state? (obs unit = a state at a point in time, e.g. state-year)
in a metro?
in a country?

It depends on which datasource you're starting with. Your WFH variable (and most or all other X variables) should be measured at the same unit level.

Let's talk about this in class, before or after.

jls224 commented 4 months ago

We will be working on the number of startups by state, with an observation unit being that state at a point in time. We are accruing datasets that have startup frequency trends and percentages of WFH by state (in some cases cities but can be included by cleaning up the data into the states), both using census data and Nick Bloom, in order to create our regression model. We will not be using specific number of people who work from home, instead we will be using a percentage of the workforce of that year that work from home by state.

Keith-Cheung / FIN377-Final-Project

Feedback on init proposal #1