davidMcneil / mnist

MNIST data set parser https://crates.io/crates/mnist
20 stars 9 forks source link

Redundant iteration when downloading archives #16

Closed Sufflope closed 10 months ago

Sufflope commented 11 months ago

download_and_extract iterates over archives but then download ignores its archive parameter and iterates again. So, the first outer loop iteration downloads every needed archive and the subsequent iterations re-trigger a useless check whether the file needs to be downloaded:

Download directory /tmp/mnist/ does not exists. Creating....
Attempting to download and extract train-images-idx3-ubyte.gz...
- Downloading from file from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz and saving to file as: /tmp/mnist/train-images-idx3-ubyte.gz
9912422 / 9912422 ╢==================================================================================================================================================================================================╟ 100.00 % 51537113.46/s
 - Downloading from file from http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz and saving to file as: /tmp/mnist/train-labels-idx1-ubyte.gz
28881 / 28881 ╢========================================================================================================================================================================================================╟ 100.00 % 473278.10/s
 - Downloading from file from http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz and saving to file as: /tmp/mnist/t10k-images-idx3-ubyte.gz
1648877 / 1648877 ╢==================================================================================================================================================================================================╟ 100.00 % 20324122.82/s
 - Downloading from file from http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz and saving to file as: /tmp/mnist/t10k-labels-idx1-ubyte.gz
4542 / 4542 ╢===========================================================================================================================================================================================================╟ 100.00 % 74586.06/s
 Extracting archive "/tmp/mnist/train-images-idx3-ubyte.gz" to "/tmp/mnist/train-images-idx3-ubyte"...
Attempting to download and extract train-labels-idx1-ubyte.gz...
  File "/tmp/mnist/train-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/train-labels-idx1-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-labels-idx1-ubyte.gz" already exists, skipping downloading.
Extracting archive "/tmp/mnist/train-labels-idx1-ubyte.gz" to "/tmp/mnist/train-labels-idx1-ubyte"...
Attempting to download and extract t10k-images-idx3-ubyte.gz...
  File "/tmp/mnist/train-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/train-labels-idx1-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-labels-idx1-ubyte.gz" already exists, skipping downloading.
Extracting archive "/tmp/mnist/t10k-images-idx3-ubyte.gz" to "/tmp/mnist/t10k-images-idx3-ubyte"...
Attempting to download and extract t10k-labels-idx1-ubyte.gz...
  File "/tmp/mnist/train-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/train-labels-idx1-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-images-idx3-ubyte.gz" already exists, skipping downloading.
  File "/tmp/mnist/t10k-labels-idx1-ubyte.gz" already exists, skipping downloading.
Extracting archive "/tmp/mnist/t10k-labels-idx1-ubyte.gz" to "/tmp/mnist/t10k-labels-idx1-ubyte"...

See how the handling of train-images-idx3-ubyte.gz triggers the 4 downloads, and then the other 3 re-check for the 4 files.

With my proposed patch:

Download directory /tmp/mnist/ does not exists. Creating....
Attempting to download and extract train-images-idx3-ubyte.gz...
- Downloading from file from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz and saving to file as: /tmp/mnist/train-images-idx3-ubyte.gz
9912422 / 9912422 ╢==================================================================================================================================================================================================╟ 100.00 % 46574828.67/s
 Extracting archive "/tmp/mnist/train-images-idx3-ubyte.gz" to "/tmp/mnist/train-images-idx3-ubyte"...
Attempting to download and extract train-labels-idx1-ubyte.gz...
- Downloading from file from http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz and saving to file as: /tmp/mnist/train-labels-idx1-ubyte.gz
28881 / 28881 ╢========================================================================================================================================================================================================╟ 100.00 % 714115.69/s
 Extracting archive "/tmp/mnist/train-labels-idx1-ubyte.gz" to "/tmp/mnist/train-labels-idx1-ubyte"...
Attempting to download and extract t10k-images-idx3-ubyte.gz...
- Downloading from file from http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz and saving to file as: /tmp/mnist/t10k-images-idx3-ubyte.gz
1648877 / 1648877 ╢==================================================================================================================================================================================================╟ 100.00 % 16212662.04/s
 Extracting archive "/tmp/mnist/t10k-images-idx3-ubyte.gz" to "/tmp/mnist/t10k-images-idx3-ubyte"...
Attempting to download and extract t10k-labels-idx1-ubyte.gz...
- Downloading from file from http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz and saving to file as: /tmp/mnist/t10k-labels-idx1-ubyte.gz
4542 / 4542 ╢==========================================================================================================================================================================================================╟ 100.00 % 111722.52/s
 Extracting archive "/tmp/mnist/t10k-labels-idx1-ubyte.gz" to "/tmp/mnist/t10k-labels-idx1-ubyte"...

Each archive is checked once in their loop iteration.