This is a fun idea and topic, glad you chose it. The write up is nice. I have some follow up questions and things you should think about:
You're right: Pre 2000 data is unnecessary
Can you be more concrete with the visualizations you'll try to create and the regressions you'll end up running?
What is your final dataset (or datasets, idk) going to be like... is it a time series (one row of each data for a given point in time) or a panel (one point in time might have several rows of data). In your write up, it seems like you'll have time series data; Note that a time series can have many variables about that point in time.
I ask because what you can do with time series data can be limited.
Your questions seem to require panel data: Think about "Is an employee more likely to leave a company if they WFH?" If you just have a time series, all you can do is see if employee turnover is higher as WFH increases in the overall economy. But a lot of things have changed over time, so it's hard to attribute the change in turnover to WFH trends. So one starting point is to see compare if some entities (firms, industries, areas) with higher WFH than others have different retention patterns.
Acquired starting dataset from which to add additional variables (see inputs folder). Acquired from BLS National Compensation Survey. It contains WFH % by industry
Committed to panel data, in part because that's what the BLS dataset came as, but mostly in order to achieve concrete associations between WFH and other variables. Also committed to 2010-2022 time frame with an explanation as to why.
Outlined the general shape of our dataset
Visualizations: Will mostly be WFH v (insert any of the variables that describe employee opportunity/performance). We will create regression models to assess the association of each variable to WFH. The proposal also describes some specific graphs we intend to create, including a heat map.
This is a fun idea and topic, glad you chose it. The write up is nice. I have some follow up questions and things you should think about:
This is a time series:
This is a panel dataset: