Closed rahulbhalerao001 closed 8 years ago
Hi Rahul,
Thanks for the comment! We need to clarify the instructions. In its current form, the following instructions are a bit verbose to put directly in the README, but for now, perhaps you could submit a pull request that replaces the relevant portions of the README with a reference to this issue? Also, if you have any ideas for simplifying this procedure, we'd appreciate that! Let us know if you have any questions about the instructions. We do the following:
Training Data
wget {URL}/ILSVRC2012_img_train.tar
.tar xvf /imgnet/ILSVRC2012_img_train.tar
. This should give you about 1000 tar files, each of which is around 200MB.s3://sparknet/ILSVRC2012_train/
.Validation Data
wget {URL}/ILSVRC2012_img_val.tar
.tar xvf ILSVRC2012_img_val.tar
, this produces a bunch of images with names like val/ILSVRC2012_val_00000001.JPEG
.Retar the validation data to create a bunch of tar files, each with size 200MB. We do this because it is faster to pull a small number of large files from S3 than to pull a large number of small files.
tar -c -M --tape-length=200M --file /tmp/pseudo-tape.tar --new-volume-script=/tmp/new-volume.sh --volno-file=/tmp/volno /imagenet/*.JPEG
mv pseudo-tape.tar val.33.tar
where new-volume.sh
is:
dir="/tmp"
base_name="pseudo-tape.tar"
next_volume_name=`echo -n "validation."; cat $dir/volno`
echo "moving $dir/$base_name to $dir/$next_volume_name.tar"
mv "$dir/$base_name" "$dir/$next_volume_name.tar"
s3://sparknet/ILSVRC2012_test/
.Labels
train.txt
and the validation labels are in val.txt
.train.txt
to S3 under s3://sparknet/train.txt
.val.txt
to S3 under s3://sparknet/test.txt
. Note that we switched the filename, which was probably a bad idea.Note we use the S3 bucket name sparknet
throughout, so you will probably need to change this in your code.
Hello Robert,
Thank you for your detailed response. As you said, for now I will put a reference to this issue in the Readme and submit a pull request.
The command and script for packing the files into multiple tars did not work for me, as it gave an error that the arguments are too long. So instead I used a script to create subdirectories with around 2000 images each and then 'tar' ed each of these subdirectories. This created 25 tars with size around 250 MB.
Thanks, Rahul
Excellent, glad to hear that worked!
The steps for running Imagenet example have the line "Tar the validation files by running" followed by a TODO note. However, at the ImageNet site, ILSVRC2012_img_val.tar file is directly available.
Does anymore preprocessing need to be done, or can it be directly uploaded to S3.
Similarly for the train data, there are two tars available at the Imagenet site Training images (Task 1 & 2). 138GB. Training images (Task 3). 728MB.
It is not clear from the Readme which tar to exactly use for this example.
Could you please shed some light on the above two points. If it seems a fair point, upon getting the clarification, if needed I will be interested in making this change to the Readme as a small contribution.