dssg / cta-sim

Big data simulation of Chicago's public transportation to improve transit planning and reduce bus crowding
http://dssg.io/projects#cta
23 stars 8 forks source link

Which database/what technology #4

Closed brandonwillard closed 11 years ago

brandonwillard commented 11 years ago

We need to put all our data (bsht ~20 GB + apc ~200 GB) on some SQL server. Some issues: on Amazon, a server on an instance, both? Talk amongst yourselves...

rayidghani commented 11 years ago

if you don't care about which sql and it isn't much data, use RDS (it's essentially mysql on amazon).

On Jun 21, 2013, at 5:37 PM, Brandon Willard notifications@github.com wrote:

We need to put all our data on some SQL server. Some issues: on Amazon, a server on an instance, both? Talk amongst yourselves...

— Reply to this email directly or view it on GitHub.

VL-Carlson commented 11 years ago

Need to standardize across projects

My mobile phone asks you to excuse typos and brevity.

On Jun 21, 2013, at 5:37 PM, Brandon Willard notifications@github.com wrote:

We need to put all our data on some SQL server. Some issues: on Amazon, a server on an instance, both? Talk amongst yourselves...

— Reply to this email directly or view it on GitHub.

brandonwillard commented 11 years ago

Does Hive (and something else) answer this question? If so, let's close this issue.

jtbates commented 11 years ago

@rayidghani Can you give your opinion on whether Hive with S3 makes the most sense for our project? All the technologies are a little dizzying - Hive, Pig, HBase, RDS, SimpleDB, DynamoDB, Redshift... We have ~200 GB of data and more coming. It's been dumped from a database, and we want to put it back into something that gives us an SQL interface and the ability to perform joins. We also want to be able to run Hadoop jobs on it directly.

hunterowens commented 11 years ago

My uninformed comment on the basis of having talked to Rayid about this a few times, but I think Redshift or RDS is going to be your answer based on data size. DynamoDB is a NoSQL key-value store, so I don't think it is going to be of much use.

RDS is mySQL on a instance, and handles under 1TB (I think) of data well.

RedShift is automatically clustered PostgresSQL instances. Handles > 1 TB of data well.

SimpleDB is key-value on top of S3.

Can't comment on Hive, HBase, Pig, etc.

Edit: I think the answer is going to be RedShift. On Tue, Jun 25, 2013 at 6:18 PM, JT Bates notifications@github.com wrote:

@rayidghani https://github.com/rayidghani Can you give your opinion on whether Hive with S3 makes the most sense for our project? All the technologies are a little dizzying - Hive, Pig, HBase, RDS, SimpleDB, DynamoDB, Redshift... We have ~200 GB of data and more coming. It's been dumped from a database, and we want to put it back into something that gives us an SQL interface and the ability to perform joins. We also want to be able to run Hadoop jobs on it directly.

— Reply to this email directly or view it on GitHubhttps://github.com/dssg/dssg-cta-project/issues/4#issuecomment-20015418 .

mdagost commented 11 years ago

You should be able to handle many TB's in a Redshift cluster pretty easily and query it with regular SQL. I don't know that you can easily run Hadoop jobs on data in Redshift though.

Hive isn't straight SQL, it's HiveQL, which is different. You should be able to run Hadoop jobs over the data that you have stored there.

Are you certain that you need to run Hadoop jobs rather than regular SQL in Redshift? What are those use-cases? My hunch is that Redshift with regular SQL is what you'd want here, unless you have really unstructured data from the CTA. But even in that case, I think you might want to run Hadoop jobs to parse the unstructured data and then load it into Redshift for regular querying there...

DynamoDB is a different use-case entirely, and you probably have too much data for RDS.

Michelangelo

On Tue, Jun 25, 2013 at 6:18 PM, JT Bates notifications@github.com wrote:

@rayidghani https://github.com/rayidghani Can you give your opinion on whether Hive with S3 makes the most sense for our project? All the technologies are a little dizzying - Hive, Pig, HBase, RDS, SimpleDB, DynamoDB, Redshift... We have ~200 GB of data and more coming. It's been dumped from a database, and we want to put it back into something that gives us an SQL interface and the ability to perform joins. We also want to be able to run Hadoop jobs on it directly.

— Reply to this email directly or view it on GitHubhttps://github.com/dssg/dssg-cta-project/issues/4#issuecomment-20015418 .

jtbates commented 11 years ago

Post-mortem: We started out with Hive, which worked for the first large join that we needed but we never were able to get it to work with a more complicated join. So we did it in Redshift and it was much faster (about 10 minutes for the more complicated join, whereas the simpler one took ~5 hours in Hive). We couldn't do partitioning with Redshift when unloading to S3, so we tried to use Hive to do that. It seems like it should have been simple since we were doing partitioning as part of the original join that worked, but I never was able to get it to run. We ended up abandoning pulling the data from S3 for the model fitting altogether and now get it directly from the Redshift database with RODBC.

TLDR: Hunter and Michelangelo were right. It would have saved a lot of pain if we had went with Redshift in the first place.