amplab / SparkNet

Distributed Neural Networks for Spark
MIT License
603 stars 172 forks source link

TODO steps in the Readme for Imagenet example. #59

Closed rahulbhalerao001 closed 8 years ago

rahulbhalerao001 commented 8 years ago

The steps for running Imagenet example have the line "Tar the validation files by running" followed by a TODO note. However, at the ImageNet site, ILSVRC2012_img_val.tar file is directly available.

Does anymore preprocessing need to be done, or can it be directly uploaded to S3.

Similarly for the train data, there are two tars available at the Imagenet site Training images (Task 1 & 2). 138GB. Training images (Task 3). 728MB.

It is not clear from the Readme which tar to exactly use for this example.

Could you please shed some light on the above two points. If it seems a fair point, upon getting the clarification, if needed I will be interested in making this change to the Readme as a small contribution.

robertnishihara commented 8 years ago

Hi Rahul,

Thanks for the comment! We need to clarify the instructions. In its current form, the following instructions are a bit verbose to put directly in the README, but for now, perhaps you could submit a pull request that replaces the relevant portions of the README with a reference to this issue? Also, if you have any ideas for simplifying this procedure, we'd appreciate that! Let us know if you have any questions about the instructions. We do the following:

Training Data

  1. Download the training data wget {URL}/ILSVRC2012_img_train.tar.
  2. Untar the training data tar xvf /imgnet/ILSVRC2012_img_train.tar. This should give you about 1000 tar files, each of which is around 200MB.
  3. (Optional) - Shuffle the training data, that is, untar the tar files, shuffle the images, and retar them in a way that preserves the original structure. We do this because the training data is unshuffled (images of the same class appear consecutively).
  4. Upload these tar files to S3 under s3://sparknet/ILSVRC2012_train/.

Validation Data

  1. Download the validation data wget {URL}/ILSVRC2012_img_val.tar.
  2. Untar the validation data tar xvf ILSVRC2012_img_val.tar, this produces a bunch of images with names like val/ILSVRC2012_val_00000001.JPEG.
  3. Retar the validation data to create a bunch of tar files, each with size 200MB. We do this because it is faster to pull a small number of large files from S3 than to pull a large number of small files.

    tar -c  -M --tape-length=200M --file /tmp/pseudo-tape.tar --new-volume-script=/tmp/new-volume.sh --volno-file=/tmp/volno /imagenet/*.JPEG
    mv pseudo-tape.tar val.33.tar

    where new-volume.sh is:

    dir="/tmp"
    base_name="pseudo-tape.tar"
    next_volume_name=`echo -n "validation."; cat $dir/volno`
    echo "moving $dir/$base_name to $dir/$next_volume_name.tar"
    mv "$dir/$base_name" "$dir/$next_volume_name.tar"
  4. Upload these tar files to S3 under s3://sparknet/ILSVRC2012_test/.

Labels

  1. Run this caffe script https://github.com/BVLC/caffe/blob/master/data/ilsvrc12/get_ilsvrc_aux.sh to obtain auxiliary information for the ImageNet dataset. The training labels are in train.txt and the validation labels are in val.txt.
  2. Upload train.txt to S3 under s3://sparknet/train.txt.
  3. Upload val.txt to S3 under s3://sparknet/test.txt. Note that we switched the filename, which was probably a bad idea.

Note we use the S3 bucket name sparknet throughout, so you will probably need to change this in your code.

rahulbhalerao001 commented 8 years ago

Hello Robert,

Thank you for your detailed response. As you said, for now I will put a reference to this issue in the Readme and submit a pull request.

The command and script for packing the files into multiple tars did not work for me, as it gave an error that the arguments are too long. So instead I used a script to create subdirectories with around 2000 images each and then 'tar' ed each of these subdirectories. This created 25 tars with size around 250 MB.

Thanks, Rahul

robertnishihara commented 8 years ago

Excellent, glad to hear that worked!