Normalizing data - Githubissues

SamAI-Software commented 8 years ago

Are we planning to normalize data for questions with open answers? For example, 5. About how much money do you expect to earn per year at your first developer job (in US Dollars)?

Avg.$ = $53K per year, but some answers have $800K and more. If set the upper bound to $200K per year, then avg.$ = $48K annually

Also some answers with "K" ($70K), and some are with full numbers ($70000) And some answers are too small to be annually, but good enough to be monthly. In many countries people never use annual salaries, but monthly. So lots of people was probably confused and wrote their monthly expectations.

I did some normalizing, but in the end the average expectation didn't change a lot. Original avg.$ = $53K/year VS $52K/year for normalized data.

Normalizing is a good practice, but it didn't change much in this example, so are we planning to do it? If yes, then we need to agree on conditions.

2016FCC_moneyAnnual_v.0.2.xlsx

evaristoc commented 8 years ago

Hi @SamAI-Software We practised some parsing but could you please QA recent changes by downloading the resulting file and suggesting discrepancies that could still exist?

Please contact @erictleung to confirm a copy of the file is available.

cc. @erictleung

erictleung commented 8 years ago

@SamAI-Software thanks for looking at the data. I did realize this and started changing the salaries, removing dollar sign and averaging ranges I found. I see you've already found my PR on this over in #29. I think we can close this issue and just carry the conversation over there.

SamAI-Software commented 8 years ago

@erictleung cool, closing this issue.

SamAI-Software commented 8 years ago

@erictleung, I reopen this issue, because PR was already merged.

Everything seems to be good, but 3 variables: CommuteTime, HomeMortgageOwe, StudentDebtOwe.

I had investigated CommuteTime for a bit.

> NROW(data.Learn[!is.na(data.Learn$CommuteTime)&data.Learn$CommuteTime>300,])
[1] 84
> NROW(data.Learn[data.Learn$LanguageAtHome=="English"&!is.na(data.Learn$CommuteTime)&data.Learn$CommuteTime>300,])
[1] 13
> NROW(data.Learn[data.Learn$LanguageAtHome!="English"&!is.na(data.Learn$CommuteTime)&data.Learn$CommuteTime>300,])
[1] 71

There are 84 answers more than 5 hours, and 71 of them from not-native English speakers. The reason I made language filter is because:

"commute" is not a popular word, I also remember to google the meaning of a word as I never saw it before;
there are more people who answered 8 hours rather than 5, 6 or 7 hours, which is very weird;
95% of those who typed 8 hours don't use English to talk with their families;
8 hours is an average working day;
"commute" is similar to a word "commit";

So my bet is that many non-native speakers had mistaken the question:

About how many minutes total do you spend commuting to and from work each day?

And they thought that we were asking about how long is their working day.

So for CommuteTime I suggest to cut off all the answers greater than 300 (5 hours) into NA, and not into 300, because we have no idea how much is their real commute time, as they confused the question or make some totally unreal number, like 600 or 1000, etc.

For HomeMortgageOwe and StudentDebtOwe we just need min & max values, because mortgages like "35" or "10 000 000" don't look trustful. I guess min. value of $1000 for both HomeMortgageOwe & StudentDebtOwe should be good to go. Answers less than min. value makes sense to cut off into NA. As for max. value, I have no idea, you should know it better. But I doubt that it's more than $1KK for mortgage or $500K for education. Answers greater than max. value makes sense to set to max. value.

Summary:

CommuteTime >300 cut off into NA StudentDebtOwe <$1000 cut off into NA >$500 000 set to $500 000 HomeMortgageOwe <$1000 cut off into NA >$1 000 000 set to $1 000 000

erictleung commented 7 years ago

Sorry, this has been long overdue. I'm got most of the code ready. I was going to try and get an updated data dictionary at the same time, but maybe I should just settle on the data first and then the data dictionary later.

I'm away from my primary development environment until next week. So I'll try to get a PR in by the end of next week dealing with these normalization issues. I also found some spelling mistakes I fixed as well..

SamAI-Software commented 7 years ago

@erictleung cool, but also consider, that data dictionary was already PR-ed with missing variables by @M0nica https://github.com/FreeCodeCamp/2016-new-coder-survey/pull/49/commits/0789b4f5c2db8919a9220ec7847ed29986e2bde6

freeCodeCamp / 2016-new-coder-survey

Normalizing data #33