Open Neato-Nick opened 9 months ago
How should these data be included in the output table? A few options:
job_type
to a list and summing the salaries. Then choose which dates to keep based on some criteria, like keep the oldest date.I lean towards summing the salaries from multiple jobs because that is the primary interest of this dataset, but am not sure best to handle the multiple titles, dates, etc.
I'm personally a fan of solution 1 because it's friendly to the "long" data format. I think it's best to avoid summing the salaries since we don't know who receives multiple salaries vs. who receives only one of them. Regardless of whether they're receiving multiple salaries, they still had to bargain each one of them individually at some point. In my mind, that makes them separate data points and therefore unique rows.
Solution 3 would also work, I could easily use the additional columns to split the data into more rows like I prefer to analyze.
Good point. It doesn't make sense to merge or aggregate data when producing the dataset.
Perhaps best to implement both 1 and 3? With 1, we add new rows for each job. With 3, we add columns for each job. And give the option between the two.
When adding rows, introduce a job_count
column.
When adding columns, do so dynamically -- if no employee has multiple jobs the output table will have no additional columns. If just one employee has four, there will be three additional columns.
Ohh I didn't think about needing to dynamically add more columns for any number of 1+n jobs. That makes me personally like solution 1 even better tbh. Still, solution 3 sounds pretty fun so you can go for it, since you're the one doing the heavy lifting with the parser. I can still very easily convert any number of n+ infinity columns to additional rows using R to fill my solution 1 desires.
Method 1 for sure then!
question tho: if we transpose the table 90 degrees, does 1 becomes 3 and 3 becomes 1?
maybe can implement both with one function? transpose 90°, add new rows under certain conditions, transpose -90°
i am honestly pretty new to dealing with tables like this so let me know if i am way off base
I think the root cause of this is the same as [1]. Basically, that because the positions aren't line-separated, only the first one is read. This one will be a harder solve though because the second entry doesn't have the personal info section with name and such.
This is more important for determining full professors effective salary. For example, dept heads and deans are likely earning the higher salary even salaries for both that job and their professor job are listed.
See the Note at the top of the personnel definitions page:
Example: