General data cleaning - Githubissues

supermdat commented 7 years ago

A series of scripts to organize the data and clean the categorical variables.

This commit contains 9 files that each begin with a number (e.g., 1_...). This number indicates the order in which the files should be run.

1_func_dist.R -- a function to calculate string distances
2_func_dist_ByFrstChr.R -- a function to calculate string distances when the text variable has too many (~ >10,000) unique entries
3_func_TopMisSpel.R -- a function to show mis-spellings in a text variable
4_NB_GeneralOrgMinorClean.Rmd -- a notebook that downloads and organizes the data, and does some light cleaning
5_NB_CleanChrVar_Category -- a notebook cleaning the "category" variable
6_NB_CleanChrVar_Office -- a notebook cleaning the "office" variable
7_NB_CleanChrVar_Program -- a notebook cleaning the "program" variable
8_NB_CleanChrVar_Purpose -- a notebook cleaning the "purpose" variable
9_NB_CleanChrVar_Payee -- a notebook cleaning the "payee" variable (NOTE: some portions of this notebook took 20-30 minutes to run on my laptop)

restrellado commented 7 years ago

Thanks for working on this! A few things so far:

Love the file naming convention. Great idea!
Running the function files work great, but I run into a little trouble when I get to 4_NB_GeneralOrgMinorClean.Rmd. Somewhere in the process my working directory changes from house_expenditures, where my local .Rproj file is kept, to NewAnalysis. Not sure if that's a misunderstanding on my part about the way .Rproj files deals with working directories, but it resulted in an error downstream and then again in 5_NB_CleanChrVar_Category.Rmd when sourcing 1_func_dist.R.
Do we need to use dir.create to make the directory ProcessedData in 4_NB_GeneralOrgMinorClean.Rmd? R threw an error there for me because I didn't have that directory.

supermdat commented 7 years ago

No problem at all! I think that like most people, I can't say I enjoy doing data cleaning, but it's absolutely essential to have good analyses later on...and I'm excited that we're close enough to do some cooler things now! I actually just started looking at the amount variable to explore a bit ;- )

For your No 2 and No 3, I think both have to do with my filing structure, and my lack of understanding of how that flows to GitHub. The filing structure is really only important in the code because when I finish a particular type of analyses/cleaning, I save the output just in case it might be needed again.

So if we comment out any code that has saveRDS I think we should be fine.

If we'd like to mimic my filing structure, I'm not sure of the best method to do that, but the dir.create seems like it should be ok.

Basically, it breaks out like this:

Where my .Rproj lives supermdat_general_exploration

Where all of my .R, Rmd, and .nb.html files live supermdat_general_exploration/Code/VariableCleaning/NewAnalyses

Where all of my data live supermdat_general_exploration/ProcessedData

For this aspect, I'm totally open to whatever works best for everyone.

restrellado commented 7 years ago

Let's try just commenting out the saveRDS parts and see if that fixes it. We'll go the simpler route first and if that doesn't work then we can try some code that creates the necessary folders. Look forward to seeing it!

supermdat commented 7 years ago

Hi @restrellado.

Just want to check-in to ask how this went for you. Was commenting out the saveRDS portion allowing this to run?

I'm continuing with cleaning/analyses related to "date" and "amounts", so there's no urgency. But I just want to check-in.

Feel free to let me know of anything needed on my end ;- )

restrellado commented 7 years ago

I updated and moved everything over to #31

Data4Democracy / house_expenditures

General data cleaning #29