[WIP] Add script to clean and combine data, and add data

erictleung commented 8 years ago

cc/ @QuincyLarson @evaristoc Feel free to comment on aspects of the changes I'll be making. I figured it would be easier and faster to get feedback by using GitHub's feature to comment on PR changes.

Closes #26

Checklist

[x] Rename column names
[x] Group code that work on a single question into separate function
[x] Clean first dataset
- [x] Which role are you most interested in?
- [x] Expect earnings as first developer job
- [x] Coding Events marked as "Other"
- [x] Podcasts marked as "Other"
- [x] Number of hours spent learning
- [x] "Number of months programming"
- [x] Months ago finished bootcamp
- [x] Salary after bootcamp
- [x] Money used for learning
[x] Clean second dataset - Below are those which will require effort to clean i.e. not just categorical or boolean
- [x] "How old are you?"
- [x] University major, if applicable
- [x] Mortgage amount
- [x] Employment status for other
- [x] Which field do you work in? (Possibly)
- [x] Money made last year
- [x] Minutes of commute
[x] Polish data for finishing e.g. remove inconsistent data
[x] Combine data into one
[x] Update root README with information on the data
[x] Rename original data/ directory to raw-data/
[ ] Update README on the cleaned data

Commit Message

Update survey data dictionary with left out questions
Update survey data dictionary with variable/column names for questions
Add script clean-data.R to clean and combine the two survey datasets into one for ease of analysis
Create the combined survey dataset after running clean-data.R
Create README.md file to explain cleaned data and the script to produce it

QuincyLarson commented 8 years ago

@erictleung Rather than giving people a script, I say we just give them the cleaned data

So if you can run your script and verify it worked, then we should remove the old csv files and replace them with your unified (and cleaned) csv file

You can commit the R script if you want for archival purposes, but I think 99.9% of the people going to the repo will just want a polished final CSV - they won't care as much about the details of our implementation

SamAI-Software commented 8 years ago

Both variants would be good to have in case of any bugs

evaristoc commented 8 years ago

@SamAI-Software about your question above:

Firstly, we all should agree on one thing. Should we just cutoff (into blank) all weird numbers or should we guess what was the real intention and then try to normalize it? And it's not only about expected salary, but about all questions with open answers with numbers.

Yes, even if that will be arbitrary. The most rigorous option is missing or "outliering".

Similarly I have been in communication with @erictleung about:

Both variants would be good to have in case of any bugs

My proposal has been to supply different levels of files:

Raw datasets
Totally Clean dataset
Annex datasets

The Totally Clean is ours with the whole parsing + our arbitrary interpretations of the meaning of the values. Annex datasets could be intermediate ones containing unchanged values for some of the variables that asked for the most of the arbitrary changes, for example all open questions like "Other". See an example at: https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_CodeEventOther

This kind of files will preserve part of the "information" we will have to get rid of when cleaning the data. A person more interested in that additional information could revisit those Annex datasets and build a new dataset if desired.

The key is to provide metadata dictionaries describing the changes.

I have been commenting to @erictleung about the need to maintain consistency:

in naming
in variable types per set of questions
the nesting
the identification of missing values
etc.

The lesser the number of inconsistencies found in the Totally Clean dataset, the better. Also, an important aspect is to provide a robust metadata file as much as we can.

evaristoc commented 8 years ago

@QuincyLarson I am not sure if you agree with keeping several files? As owner of the project, you might have the final decision. I understand that for you the best is to keep the last file ONLY, but be aware that our decisions when cleaning, even if well guided or well intended, will be always arbitrary ones, and they will risk information that someone could find interesting.

Whatever the case I will always insist in proper metadata dict and change file.

SamAI-Software commented 8 years ago

My proposal has been to supply different levels of files:

Raw datasets

Totally Clean dataset

Annex datasets

@evaristoc sounds great!

evaristoc commented 8 years ago

Quick before having to go: @SamAI-Software about maintaining a max. of 100 hours (weekly): I agree. Also, about all questions with the "Other" open answer option:

the idea is to vectorise/factorise the categories; the "Other" options SHOULD disappear or contain Unknowns in the worst case.

I have commented this to @erictleung. Multi-answered question and or an Other option should be vectorised instead of giving a Other in categorical format and the related questions in Boolean.

@erictleung this is the current challenge we are facing with those questions. Also, as @QuincyLarson suggested it would be better if we give a completely digested file to users. Totally agree:

Final user shouldn't be bothered in trying to normalize / parse any values: only in directly making data representations/visualizations; otherwise asking someone to work on some particular values would be painstaking.

Our goal should be to parse and vectorize the values, even if we have to make arbitrary decisions. Those arbitrary decisions would be always documented though.

I personally agree that the SIMPLEST and probably the BEST change file we can suggest is in fact YOUR code, @erictleung, and likely this thread.

evaristoc commented 8 years ago

@QuincyLarson @SamAI-Software

I wrote to @erictleung :

I gave it a second thought and realised that even trying a better categorisation of the variables

PodcastOther
CodeEventOther
ResourceOther

is not necessarily informative enough. For example: the name I am giving is mostly arbitrary: would that name help the user to identify and find the resource online if desired?

Some answers are not easy to solve. For example: There are cases where some people reported attending "meetups" without specifying what for meetup, while other ones specified a name of specific meetups, but it is a meetup at the end. So

how shall we define those categories in order to provide enough info without trying to go too far with detailed naming?

Considering the quality of the data and the difficulties to take clearcut decision about how to operationalize some of the responses, I think that we are better off by NOT trying to vectorize all that info of the aforementioned variables. Otherwise we could end up unnecessarily obfuscating the Totally Clean Dataset.

In order to support the users what we can do is offering Annexes in a similar form as the following: https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_EventsOtherDrafted.csv

with a tentative, partial operationalization, without cross-comparison (there are categories that users tended to repeat between questions). The user could use those tentative, informal definitions while still invited to propose a personal one that could better work for her analysis.

SamAI-Software commented 8 years ago

@erictleung latest dataset (10h ago) looks good :+1:

The only question is about consistency - why booleans are sometimes factors and sometimes integers?

Integer booleans: IsSoftwareDev, JobRelocate, BootcampYesNo, BootcampFinish, BootcampFullJobAfter, BootcampLoan, BootcampRecommend, CodeEventCoffee, CodeEventHackathons, CodeEventConferences, CodeEventNodeSchool, CodeEventRailsBridge, CodeEventStartUpWknd, CodeEventWomenCode, CodeEventGirlDev, CodeEventNone, PodcastCodeNewbie, PodcastChangeLog, PodcastSEDaily, PodcastJSJabber, PodcastNone

Factor booleans: ResourceEdX, ResourceCoursera, ResourceFCC, ResourceKhanAcademy, ResourcePluralSight, ResourceCodeacademy, ResourceUdacity, ResourceUdemy, ResourceCodeWars, ResourceOdinProj, ResourceDevTips,

And why ExpectedEarning, HoursLearning, MonthsProgramming are integer, while BootcampPostSalary, MoneyForLearning, BootcampMonthsAgo are numeric?

Other bugs I'll comment as usually on the code later on today, but the data is already looking pretty clean and shiny :)

erictleung commented 8 years ago

@SamAI-Software they are different because of how they are inherently read into R (I'm assuming you're using R to read them in).

I still need to do a pass over all of the variables and force a certain data type. I'll have to double check the integer and numeric values. I think it has to do with some values being 0.0 or something with a decimal point.

evaristoc commented 8 years ago

@SamAI-Software which were your conventions for HoursLearning at #40?

Not sure if added to the datasets, @erictleung?

summary(as.factor(part1$HoursLearning))

Give some good but also weird values (a few though): one 2 0.2 1 0.5 1 .1 1 100000000000000 1 100 hours per week 1 10-15 1 12321231231232123123123123123123123 1 14 hours 1 15-20 1 2-20 1 .25 1 2.5 1 300000000000000000000 1 3-4 1 40-50 1 4-6 1 5-7 1 5-8 1 6-8 1 (Other) 11 NA 788 #

Just looking at the datasets I didn't find any changes...

evaristoc commented 8 years ago

@erictleung @SamAI-Software

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

evaristoc commented 8 years ago

@erictleung :

Age: two records over 100 in part2

SamAI-Software commented 8 years ago

@evaristoc are you sure that you take data here?

str(as.factor(data.Learn$HoursLearning)) summary(as.factor(data.Learn$HoursLearning))

I have no problems, 73 levels from 0 to 100

which were your conventions for HoursLearning at #40?

## Remove the word "hour(s)" ## Remove hyphen and "to" for ranges of hours ## Remove hours greater than 100 hours And of course round decimal numbers

we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

Yes, you are absolutely right, we need many conventions for the second dataset, feel free to find all weird answers and suggest your solutions, because today I'll be focused on #41 and on first dataset, there are still bugs to be fixed.

SamAI-Software commented 8 years ago

@erictleung can you, please, upload here the second "clean" dataset with new variable (column) names? Even though it's still "dirty", but it's better, so people can use new names to start creating their visualizations.

As I can see you already made a merge function :+1: So may be just merge 2 datasets together and upload it? Anyway the first past is not totally clean yet.

evaristoc commented 8 years ago

@SamAI-Software @erictleung

About my differences for Age and HoursLearning variables:

I am running what I think is the last clean-data.R on original files (before the last update by Quincy) locally.

I checked data of the part2 file separately.

Should I join both files and then check those values? Maybe use the last raw data update instead? (in theory should be the same for both...)

SamAI-Software commented 8 years ago

@evaristoc oh, I see! It didn't work for me either, but @erictleung said that it's okey, because script is not ready yet and he runs it line-by-line (function-by-function), so I just always grab latest clean data here.

evaristoc commented 8 years ago

@SamAI-Software Is it not that file the Part1 of the survey only? I was checking Part2...

evaristoc commented 8 years ago

@erictleung @SamAI-Software :

Working on my proposal for naming conventions of the many resources provided by respondents through the "Other" question:

Podcast: https://github.com/evaristoc/2016-new-coder-survey/blob/clean-and-combine-data/clean-data/factors_PodcastOtherDrafted.csv

-->> Ordered by PodcastOther column.

This is going to be kept fuzzy so please check and make observations but try not to be too rigorous.

One important thing to check should be that the user should be able to "easily" couple this info with the main file when applicable, so please let's agree on data format conventions?

SamAI-Software commented 8 years ago

@erictleung about part 1. I don't see any normalization on some variables in the code, so comment here.

BootcampMonthsAgo - luckily we don't have any weird values, but this R code might be used as a foundation for the next survey 2017 or by other parties. So how about adding some normalization?

same as for MonthsProgramming (round decimal, no char, no "-", etc)
max. value = 240

BootcampPostSalary - same story, but here we also have many weird values like 4, 10, 15, probably ppl mean 4 k, 10 k, 15 k, etc.

add all normalizations from ExpectedEarning

MoneyForLearning - also many weird values

add some normalizations from ExpectedEarning ("k", "to", decimal, "-", no char)
min. value = $6000?
max. value = $200000?

Max. and min. values are totally up to you, because I have no idea how much ppl in USA usually borrow for that. And it's mostly USA respondents who do that.

erictleung commented 8 years ago

@SamAI-Software noted. I'll try to make those changes after work today. Been busy working on joining the datasets together.

SamAI-Software commented 8 years ago

@erictleung cool! And please don't forget to upload the merged version

erictleung commented 8 years ago

@SamAI-Software @evaristoc @QuincyLarson sorry I haven't responded here lately. Here's an update.

Table Joining Update:

I have been working hard on joining the datasets. I managed to retain 15,635 responses out of a maximum total 15,653 responses (i.e. we lost 18 responses).

Create New Columns for "Other" Values:

Another breakthrough/suggestion I have is dealing with the "other columns" (e.g. CodeEventOther). Initially, I was trying to normalize the text within the value cells themselves. Instead of modifying the values themselves, why not create a new column indicating that option? @evaristoc I think we may have discussed this briefly but I couldn't come up with a solution until now.

For example, @SamAI-Software has noted that for PodcastOther, there have been a noticeable number of people saying they use "Ruby Rogues." Instead of finding the values and changing them, why not make another column "PodcastRubyRogues" with "1" indicating someone chose it.

This method prevents any bias from my current method of replacing an entire list of "Other" terms and retains all the possibly important information. Secondly, this will allow us to quickly group similar terms together that we can subjectively rate as being notable. Thirdly, this will create a much "cleaner" dataset, albeit missing small low frequency details.

We can then label all other rare/low frequency choices as "Other." I propose we deem responses that appear more than 5 times as "notable" and worth giving a column. I'm open to suggestions on a threshold.

I've created a function search_and_create() to do just this.

Misc Comment Responses:

Here are my responses to misc comments:

@evaristoc Some answers are not easy to solve. For example: There are cases where some people reported attending "meetups" without specifying what for meetup, while other ones specified a name of specific meetups, but it is a meetup at the end.

For the "meetup" challenge, I propose if they specify "Free Code Camp Meetup XXXX" (or something similar), then we will categorize that as specifically a "CodeEventFCCMeetup." We can discuss groupings as a case-by-case basis. The more specific we can be first while retaining a "significant" number of responses, I think is good.

@SamAI-Software The only question is about consistency - why booleans are sometimes factors and sometimes integers?

Worked on joining the datasets, so I haven't made time to work on this yet. This will probably be done at the end as a polishing step to the data.

@evaristoc we need some convention for CommuteTime. Several people decided to give time in minutes instead of hours (!!!).

Yes, we should address this. Some things to consider:

The ones with hours | h we can easily convert.
I feel we should remove outliers, but there are outlier-like responses like 600 minutes (= 10 hours = 5 hours one way), but there are 8 responses for it.
Reading the question more carefully, "About how many minutes total do you spend commuting to and from work each day?", I feel this could be misinterpreted as one way commute time if read too quickly.

@evaristoc Age: two records over 100 in part2

Thanks, I'll still need to remove those. They were part of the second part, which I haven't really focused on but will be pretty easy to fix.

@SamAI-Software BootcampMonthsAgo

I think we can write the code to normalize this when the time comes. Right now might not be the best use of our time.

@SamAI-Software BootcampPostSalary

Yes, I still need to work on this as well. But when I looked, there weren't any weird values.

@SamAI-Software MoneyForLearning min. value = $6000? max. value = $200000?

I'm okay with the maximum value but the minimum value I feel should be $0, as some people might have solely used online resources and doesn't seem too unbelievable to happen these days.

SamAI-Software commented 8 years ago

@SamAI-Software MoneyForLearning min. value = $6000? max. value = $200000?

@erictleung I'm okay with the maximum value but the minimum value I feel should be $0, as some people might have solely used online resources and doesn't seem too unbelievable to happen these days.

Yeah, sure, min = $0! My fault, I thought it was a sub-question from student loans

SamAI-Software commented 8 years ago

@erictleung, thanks for combined data (draft)! But I have one important question. What is ID.x and ID.y? And why they are different? So, each row is surely 1 unique person, isn't it? Or is it 2 different persons for now (draft)?

erictleung commented 8 years ago

@SamAI-Software sorry I didn't mention that. x is generally for the first dataset, y is for the second dataset. There are some lone y suffixed columns because they are sole from the second dataset. I left those in because of possible discrepancies I want to investigate later or validation (e.g. validating resources other).

My joining is imperfect btw. There are less rows that there should be. This is probably because of the lack of uniqueness in the columns I joined on.

Also, I added another column OneTwoDiff, which is the difference in times in seconds between the end of the first part of the survey (Part1EndTime) and the start of the second part of the survey (Part2StartTime).

I reordered the columns for ease of exploring and debugging. We can change the order later.

SamAI-Software commented 8 years ago

Also, I added another column OneTwoDiff, which is the difference in times in seconds between the end of the first part of the survey (Part1EndTime) and the start of the second part of the survey (Part2StartTime).

:+1:

@erictleung, but you didn't answer, so are you sure that now each row is one person? Just wonder, how did you combine data? By NetworkID? So I can check it later.

erictleung commented 8 years ago

@SamAI-Software oh sorry, it was late for me and I read it too fast. Within my main() function, I joined the two datasets based on the variable key I created.

And yes, each row is supposed to represent a single person. The two IDs for each row are different because each half of the survey has a unique identifier.

key <- c("IsSoftwareDev", "JobPref", "JobApplyWhen", "ExpectedEarning",
         "JobWherePref", "JobRelocate", "BootcampYesNo",
         "MonthsProgramming", "BootcampFinish",
         "BootcampFullJobAfter", "BootcampPostSalary", "BootcampLoan",
         "BootcampRecommend", "MoneyForLearning", "NetworkID",
         "HoursLearning")
allData <- left_join(consistentData$part1, consistentData$part2, by = key)

Also, the draft complete dataset is not transformed at all i.e. all the cleaning I did for the first part hasn't been applied. You've probably noticed this if you've looked at the data.

SamAI-Software commented 8 years ago

@erictleung

Within my main() function, I joined the two datasets based on the variable key I created.

Great, looks solid! :+1:

all the cleaning I did for the first part hasn't been applied.

Yeah, that's sad. Then I will try to check the combining dataset later on, once you apply cleaning from the first part. There are correlated questions like IsSoftwareDev and EmploymentStatus, so we should find bugs if there are some.

freeCodeCamp / 2016-new-coder-survey