amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

Data samples URL? #145

Open futurechimp opened 10 years ago

futurechimp commented 10 years ago

Hi,

I am interested in completing the tutorials, although I'm not at an Ampcamp (so I don't have access to the AMIs you're using there).

Is there anywhere I can download the Wikipedia data set you're using as the basis of the tutorials? I have looked on the Wikipedia public datasets pages but I don't see anything that looks right. A link to the dataset at the very start of the tutorials would be really helpful.

kayousterhout commented 10 years ago

Did you see these instructions? http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html

These point you to a script that will launch EC2 instances for you and automatically load the data; those will work even if you're not at an AMPCamp. Are those scripts not working for you?

On Thu, Mar 13, 2014 at 4:35 AM, Dave Hrycyszyn notifications@github.comwrote:

Hi,

I am interested in completing the tutorials, although I'm not at an Ampcamp (so I don't have access to the AMIs you're using there).

Is there anywhere I can download the Wikipedia data set you're using as the basis of the tutorials? I have looked on the Wikipedia public datasets pages but I don't see anything that looks right. A link to the dataset at the very start of the tutorials would be really helpful.

Reply to this email directly or view it on GitHubhttps://github.com/amplab/training/issues/145 .

futurechimp commented 10 years ago

Hey, thanks for the pointer - I didn't realize I'd need to actually use the EC2 setup (I have Spark and Shark running locally and I was in a "run it on my machine" mindset when I asked the question).

I'm sure the scripts run fine (and will try them out to be sure), I was just wondering if that dataset is available publicly anywhere. If not, I'll grab it off the server and pull it down to my local setup.

shivaram commented 10 years ago

You can get the data the wiki stats data from s3 in the bucket s3://ampcamp-data/wikistats_20090505-01

petro-rudenko commented 10 years ago

I'm using local cluster also, would be nice to provide public URL for dataset.

dossett commented 10 years ago

I too think it would be great to have a public URL for the datasets.

etrain commented 10 years ago

The files are publicly acessible - you can copy them down via a tool like s3cmd (https://github.com/s3tools/s3cmd)

Alternatively - the files in that bucket are numbered part-00096 through part-00167. It is possible to access them at a URL like this:

http://ampcamp-data.s3.amazonaws.com/wikistats_20090505-01/part-00167

On Wed, Jun 18, 2014 at 11:09 AM, Aaron Niskode-Dossett < notifications@github.com> wrote:

I too think it would be great to have a public URL for the datasets.

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/145#issuecomment-46472092.

dossett commented 10 years ago

Thank you! What about the MovieLens data used in the MLlib section?

etrain commented 10 years ago

Those files are small and so we just included them in the AMI - they are available here: http://files.grouplens.org/datasets/movielens/ml-1m.zip

On Wed, Jun 18, 2014 at 11:58 AM, Aaron Niskode-Dossett < notifications@github.com> wrote:

Thank you! What about the MovieLens data used in the MLlib section?

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/145#issuecomment-46478428.