Closed supermdat closed 7 years ago
Thanks for working on this! A few things so far:
Love the file naming convention. Great idea!
Running the function files work great, but I run into a little trouble when I get to 4_NB_GeneralOrgMinorClean.Rmd
. Somewhere in the process my working directory changes from house_expenditures
, where my local .Rproj
file is kept, to NewAnalysis
. Not sure if that's a misunderstanding on my part about the way .Rproj
files deals with working directories, but it resulted in an error downstream and then again in 5_NB_CleanChrVar_Category.Rmd
when sourcing 1_func_dist.R
.
Do we need to use dir.create
to make the directory ProcessedData
in 4_NB_GeneralOrgMinorClean.Rmd
? R threw an error there for me because I didn't have that directory.
No problem at all! I think that like most people, I can't say I enjoy doing data cleaning, but it's absolutely essential to have good analyses later on...and I'm excited that we're close enough to do some cooler things now! I actually just started looking at the amount
variable to explore a bit ;- )
For your No 2 and No 3, I think both have to do with my filing structure, and my lack of understanding of how that flows to GitHub. The filing structure is really only important in the code because when I finish a particular type of analyses/cleaning, I save the output just in case it might be needed again.
So if we comment out any code that has saveRDS
I think we should be fine.
If we'd like to mimic my filing structure, I'm not sure of the best method to do that, but the dir.create
seems like it should be ok.
Basically, it breaks out like this:
Where my .Rproj
lives
supermdat_general_exploration
Where all of my .R
, Rmd
, and .nb.html
files live
supermdat_general_exploration/Code/VariableCleaning/NewAnalyses
Where all of my data live supermdat_general_exploration/ProcessedData
For this aspect, I'm totally open to whatever works best for everyone.
Let's try just commenting out the saveRDS
parts and see if that fixes it. We'll go the simpler route first and if that doesn't work then we can try some code that creates the necessary folders. Look forward to seeing it!
Hi @restrellado.
Just want to check-in to ask how this went for you. Was commenting out the saveRDS
portion allowing this to run?
I'm continuing with cleaning/analyses related to "date" and "amounts", so there's no urgency. But I just want to check-in.
Feel free to let me know of anything needed on my end ;- )
I updated and moved everything over to #31
A series of scripts to organize the data and clean the categorical variables.
This commit contains 9 files that each begin with a number (e.g., 1_...). This number indicates the order in which the files should be run.
1_func_dist.R -- a function to calculate string distances
2_func_dist_ByFrstChr.R -- a function to calculate string distances when the text variable has too many (~ >10,000) unique entries
3_func_TopMisSpel.R -- a function to show mis-spellings in a text variable
4_NB_GeneralOrgMinorClean.Rmd -- a notebook that downloads and organizes the data, and does some light cleaning
5_NB_CleanChrVar_Category -- a notebook cleaning the "category" variable
6_NB_CleanChrVar_Office -- a notebook cleaning the "office" variable
7_NB_CleanChrVar_Program -- a notebook cleaning the "program" variable
8_NB_CleanChrVar_Purpose -- a notebook cleaning the "purpose" variable
9_NB_CleanChrVar_Payee -- a notebook cleaning the "payee" variable (NOTE: some portions of this notebook took 20-30 minutes to run on my laptop)