TSFelg / fairly

Fairly is a tool to help tech workers residing in Portugal know if they're being paid fairly.
30 stars 3 forks source link

Scope applications for Teamlyzer data #2

Open TSFelg opened 3 years ago

TSFelg commented 3 years ago

Teamlyzer has an open database with several open datasets of salaries in Portugal (not only tech). It would be interesting scope how this data can be used in Fairly. It can be used simply to train the model with more data, or it can be used to expand the coverage of job types besides tech.

ghost commented 3 years ago

We can add a link pointing/embed the current url of fairly to increase visibility of this tool. Or maybe some type of integration since we are working on the same problem.

There are also thousands of portuguese salaries shared in stackoverflow surveys or knowyourworth, the problem is the normalization as always.

TSFelg commented 3 years ago

That's quite interesting, I have to take a look into the data, but eventually some type of integration would make sense.

Also, the extra visibility would be appreciated :)

TSFelg commented 3 years ago

Also, would you mind elaborating a bit more when you say the problem is the normalization? I imagine it's the fact that different datasets collect different variables, but would appreciate your feedback on the typical issues you face.

ghost commented 3 years ago

Also, would you mind elaborating a bit more when you say the problem is the normalization? I imagine it's the fact that different datasets collect different variables, but would appreciate your feedback on the typical issues you face.

yeah, each survey has a different structure like seniority, some surveys use years of experience, others senior, junior, middle, and so on.

The same for role, a back-end golang developer earns much more than a back-end php developer , so convert both to "back-end developer" will ignore this type of details.

And from my experience all datasets needs always some manual validation especially surveys with open fields to check potential fake data like junior | 150k | lisbon

TSFelg commented 3 years ago

Thanks, that's great info! I think a a lot of those are very interesting machine learning challenges so I'm quite excited to try and tackle them :)

For example, it's possible to formulate a modelling strategy that can both leverage data wich only specifies back end as well as data that specifies the languages/frameworks. Some outlier detection can also help to detect those types of fake cases, not necessarily automatically, but at least make them stand out and then a human can just confirm if it's bad data or not.

ghost commented 3 years ago

modelling strategy that can both leverage data wich only specifies back end as well as data that specifies the languages/frameworks

In that case I think you need some Named Entity Recognition framework. Maybe this paper can be helpful.

These guys are doing an awesome work with NER https://www.glasssquid.io/try-analyze