WIP - Station Stats CSV

WorldBank-Transport / open-transit-indicators

An open-source tool to support transport agencies in planning and managing public transit systems

GNU General Public License v3.0

44 stars 21 forks source link

WIP - Station Stats CSV #640

Closed moradology closed 9 years ago

ddohler commented 9 years ago

+1; I'm not in love with storing giant CSV blobs in the database, but if there aren't any other workable options then that's probably the best way to go.

This could use another set of eyes though since it's fairly long and I haven't exercised my Scala muscles in a while; maybe @notthatbreezy could glance at it?

ddohler commented 9 years ago

It would also be good to try this out on a largish system (e.g. Zhengzhou) to get a sense for what kind of database resources this uses.

notthatbreezy commented 9 years ago

Looks good to me too, but I have the same question as @ddohler about why you chose the database to store it instead of the filesystem. Seems like this CSV is akin to the other jobs indicators which we just drop into the filesystem, keying it on the scenario/job id. I do worry about the larger systems like Chicago or Zhengzhou which have many more stops than Philadelphia

moradology commented 9 years ago

Why the concern about bytes in Postgres? I used it so that I could store some information about the settings with which the CSV was processed. Databases store images which are many times larger than any CSV we'll be creating could conceivably be, I think this is probably alright

notthatbreezy commented 9 years ago

I don't agree -- that's why we were wondering if you ran this on Chicago or Zhengzhou and how big the files are. Storing large blobs in the DB can adversely affect the DB performance and usually storage of images and files in a DB is eschewed in favor of storing file metadata in the database and storing the file in the filesystem, a pretty common pattern that we use regularly. There is decent rundown of the pluses, minuses, and considerations involved in going down this path here.

moradology commented 9 years ago

Assuming 2bytes per char and 50 chars per line (that's a very generous estimate) and Chicago's roughly 12000 stops, the file will be about 10mb. I really doubt that storing a single 20mb(for argument's sake) file will have any negative consequences. I don't think the storage of binary data in dB is so verboten that it shouldn't be used for a simple case like this. And again, it will only hold one file as is.

ddohler commented 9 years ago

Yeah -- I'm not necessarily against doing this, but I think we should make sure that it can successfully handle a big transit system since it's unfamiliar territory.

notthatbreezy commented 9 years ago

Cool, we probably aren't going to run into issues with it; doesn't seem like this use case calls for using the database either though.

jbranigan commented 9 years ago

Was this tested with Chicago, or just estimated? Testing is better before saying it's done.