Only the first position is parsed for any employees with multiple positions

daliposc / osu-salaries

parse salary .pdf into pandas df to analyze salary by department type

1 stars 1 forks source link

Only the first position is parsed for any employees with multiple positions #2

Open Neato-Nick opened 9 months ago

Neato-Nick commented 9 months ago

I think the root cause of this is the same as [1]. Basically, that because the positions aren't line-separated, only the first one is read. This one will be a harder solve though because the second entry doesn't have the personal info section with name and such.

This is more important for determining full professors effective salary. For example, dept heads and deans are likely earning the higher salary even salaries for both that job and their professor job are listed.

See the Note at the top of the personnel definitions page:

Note: An employee may have more than one job, but may not be receiving pay from all jobs.

Example:

daliposc commented 9 months ago

How should these data be included in the output table? A few options:

Treat each job as a separate entry, meaning multiple rows for a single employee
Merge the jobs together, perhaps changing job_type to a list and summing the salaries. Then choose which dates to keep based on some criteria, like keep the oldest date.
Include additional columns to handle extra jobs

I lean towards summing the salaries from multiple jobs because that is the primary interest of this dataset, but am not sure best to handle the multiple titles, dates, etc.

Neato-Nick commented 9 months ago

I'm personally a fan of solution 1 because it's friendly to the "long" data format. I think it's best to avoid summing the salaries since we don't know who receives multiple salaries vs. who receives only one of them. Regardless of whether they're receiving multiple salaries, they still had to bargain each one of them individually at some point. In my mind, that makes them separate data points and therefore unique rows.

Solution 3 would also work, I could easily use the additional columns to split the data into more rows like I prefer to analyze.

daliposc commented 9 months ago

Good point. It doesn't make sense to merge or aggregate data when producing the dataset.

Perhaps best to implement both 1 and 3? With 1, we add new rows for each job. With 3, we add columns for each job. And give the option between the two.

When adding rows, introduce a job_count column.

When adding columns, do so dynamically -- if no employee has multiple jobs the output table will have no additional columns. If just one employee has four, there will be three additional columns.

Neato-Nick commented 9 months ago

Ohh I didn't think about needing to dynamically add more columns for any number of 1+n jobs. That makes me personally like solution 1 even better tbh. Still, solution 3 sounds pretty fun so you can go for it, since you're the one doing the heavy lifting with the parser. I can still very easily convert any number of n+ infinity columns to additional rows using R to fill my solution 1 desires.

daliposc commented 9 months ago

Method 1 for sure then!

question tho: if we transpose the table 90 degrees, does 1 becomes 3 and 3 becomes 1?

maybe can implement both with one function? transpose 90°, add new rows under certain conditions, transpose -90°

i am honestly pretty new to dealing with tables like this so let me know if i am way off base