Data4Democracy / house_expenditures

18 stars 10 forks source link

Cleaning and standardizing the variable "Program." #17

Closed supermdat closed 7 years ago

supermdat commented 7 years ago

Using the Jaro-Winkler distance to determine the distance between the unique entries for "program". Then creating a lookup table to clean the "program" variations. Then joining back to the main table.

restrellado commented 7 years ago

Great work! Thank you again for all the time you've put into this. It's coming along great! I'm going to leave it open for a bit in case others have feedback.

I had a few thoughts, just for consideration:

supermdat commented 7 years ago

Thanks! And thanks for the tidyverse tip ;- )

Yeah, I was actually thinking about how it would be best to run things. I'm relatively new at all things Git, so I don't know what's best there, but I'm pretty much open to anything.

Some ideas I had were:

For Option 1, I think the benefit is that everything is compartmentalized and a bit easier digest. The negative is the sequencing that you pointed out, and the need to build some internal functions within each file (e.g., the distance function I used is created in each file).

For Option 2, it's basically the exact opposite. The end result would be a much longer file that could get unwieldy. But, everything could be run together once and there would be no duplication in creating functions.

In the end, I'm up for whatever's easier for the group or on your end. Either option (or others) work for me ;- )

On Sun, May 21, 2017 at 11:22 PM, Ryan Estrellado notifications@github.com wrote:

Great work! Thank you again for all the time you've put into this. It's coming along great! I'm going to leave it open for a bit in case others have feedback.

I had a few thoughts, just for consideration:

  • I think loading tidyverse automatically loads magrittr and ggplot2, so you can save a couple lines there :)
  • Should we document somewhere the order that the scripts need to be run? I figured it out eventually, but might be quicker for folks who are new to the repo. For example, 2017-05-21-supermdat_ CleanChrVars_Program.Rmd needed the object SpellAdjustOffice in memory before running

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/house_expenditures/pull/17#issuecomment-302990358, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ5uFmUUGkxOD9QVLO9TQk_mSvqT5mTCks5r8P-SgaJpZM4Nhwm- .

restrellado commented 7 years ago

All really good points! I'm pretty new to this too so I was googling around and found this Stack Overflow post. Doing this as an R package could be a fascinating endeavor, but I'm not sure if that's overkill and I've never done one myself. Of the two options, I tend to lean towards the separate scripts because it's easier to digest. Maybe after it's all done we do a really short script that uses source() to run the separated longer scripts in the correct sequence. I've seen people use a script like this that they call do.R or something similar. I'll request a review from @ehbick01 and @dwillis too. Thanks again!

supermdat commented 7 years ago

Sounds good to me :- )

Thanks!

On Mon, May 22, 2017 at 11:48 PM, Ryan Estrellado notifications@github.com wrote:

All really good points! I'm pretty new to this too so I was googling around and found this Stack Overflow post https://stackoverflow.com/questions/1266279/how-to-organize-large-r-programs. Doing this as an R package could be a fascinating endeavor, but I'm not sure if that's overkill and I've never done one myself. Of the two options, I tend to lean towards the separate scripts because it's easier to digest. Maybe after it's all done we do a really short script that uses source() to run the separated longer scripts in the correct sequence. I've seen people use a script like this that they call do.R or something similar. I'll request a review from @ehbick01 https://github.com/ehbick01 and @dwillis https://github.com/dwillis too. Thanks again!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/house_expenditures/pull/17#issuecomment-303282988, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ5uFoCcHcoF7IjcjmssX6xeLLcwTBTPks5r8lcigaJpZM4Nhwm- .

dwillis commented 7 years ago

In general, I'm in favor of using separate scripts as well. This is looking really good!

restrellado commented 7 years ago

Ok then! Let's keep it going. @supermdat let's go with the separate file strategy. Maybe when it's all done we can have a simple script that sources the cleaning scripts in order and outputs the CSV. Great work!

supermdat commented 7 years ago

Cool! Let's do it!

On Fri, May 26, 2017 at 5:17 PM, Ryan Estrellado notifications@github.com wrote:

Ok then! Let's keep it going. @supermdat https://github.com/supermdat let's go with the separate file strategy. Maybe when it's all done we can have a simple script that sources the cleaning scripts in order and outputs the CSV. Great work!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/house_expenditures/pull/17#issuecomment-304390258, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ5uFtUH6YLZ6s0oKom5iIw77BA9k98jks5r90GGgaJpZM4Nhwm- .