CityOfPhiladelphia / open-budget-data-transformer

Master repo of data cleanup scripts for the various budget cycles
1 stars 1 forks source link

Why the switch from node to python? #7

Closed mikecasas closed 8 years ago

mikecasas commented 8 years ago

Just wondering why in the FY16 transform you are using node, then in FY17 you switch to python?

Is there some performance reason?

timwis commented 8 years ago

Hey @mikecasas! Thanks for checking out our code, and sorry some of them are in "work in progress" state!

The reason we switched to Python for data cleanup is primarily around maintenance. There are more people in city government who have some Python experience than have NodeJS experience because (a) Python has been around longer, and (b) our GIS analysts need to use it occasionally to interact with esri software programmatically. Python is also easier to get started with and to read in my opinion, largely because of javascript's asynchronous nature.

Our goal is to write code that someone who's never met us could diagnose and maintain without speaking to us. I can't promise this repo is the best example of that because (a) it's a combination of 4 repos and I haven't had time to clean it up, and (b) I'm still new to Python. But even just comparing the fy16-adopted folder to the fy-17adopted cleanup script hopefully conveys a little of what I mean (again, pardon the laughable variable name, it was late!).

Having said all that, we still use JavaScript all the time for our web stuff. It's my favorite language personally, and I use it daily. I would love to hear your thoughts and feedback. Maybe there are things we haven't considered. Thanks again for reaching out!

mikecasas commented 8 years ago

Thanks @timwis for the reply. I love your goal and totally agree with it.

Interesting comment on Python experience for city government. I work for a local city and we are a MS shop, so .NET is prevalent. I am familiar with JS and Node being a web dev, but I see Python out in the open. I think either language is good - whatever gets the job done.

My initial thought is that you came up against a performance barrier because your data set was becoming too large, and that's what caused the switch to Python. I think for this simple case of formatting the data, either language isn't too difficult for someone to pick up. I would lean to Node because that is what I am familiar with.

You all are doing a great job, so don't apologize about the "work in progress" nature - truly all software is a wip.

Thanks again.

timwis commented 8 years ago

Thanks @mikecasas! Yeah, the biggest data we've dealt with has been about 6-10 million records. Node or python should work fine at that scale. The best thing you can do IMO is use streams and operate on one record at a time, rather than loading the whole file into memory. Of course there are some occasions when you can't do that (sorting, aggregating, etc.) but we can usually get away with it, which makes the scripts scaleable to operate on a file of theoretically any size.