fastai / fastai

The fastai deep learning library
http://docs.fast.ai
Apache License 2.0
26.09k stars 7.54k forks source link

untar_data does not untar .tgz, .gz, or .tar.tgz files #1130

Closed mlsmall closed 5 years ago

mlsmall commented 5 years ago

Untar_data function gives an error "not a gzip file" when trying to download and untar a file. Describe the bug

To Reproduce

untar_data('https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz') untar_data('http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz') untar_data('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz') Expected behavior

I expect the file to be downloaded and untarred Screenshots

Additional context

Getting an error that says "not a gzip file"

odysseus0 commented 5 years ago

What is going on here

Here I will explain the current behavior of untar_data. As for if such behavior is desirable, people who have an opinion can express it here.

Here is an example of the use of untar_data in the lesson 1 notebook:

path = untar_data(URLs.PETS)

From the naming of the variable, it is common to assume that URLs.PETS is an actual url. Looking at the documentation of untar_data, it shows that

untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True)

"Download url if it doesn't exist to fname and un-tgz to folder dest"

So it also suggests that the first argument of untar_data is a url.

If you look into the exact content of URLs.PETS, which is simply a constant field of the class URLs, you will find that it is defined as

'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet'

However, if you try to visit that url or download the file using command like wget, you will find that this url is invalid. Indeed, the actual address is

'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz'

with .tgz added at the end.

What I do find confusing here is that, though both the naming URLs and the docs of untar_data suggest that its first argument is a url, URLs.PETS is actually not a valid url. Digging a bit deeper, we found in the source code of download_data function called by untar_url function the following lines

if not fname.exists():
    print(f'Downloading {url}')
    download_url(f'{url}.tgz', fname)

So it is clear that the url variable here is actually not a true url, but one with .tgz missing at the end. We can also infer that the url parameter untar_data takes and every field of URLs class is actually the true url with .tgz removed from the end.

Now going back to your post, to download any dataset ending with .tgz, simply remove the .tgz and throw it into untar_data and it should work. For datasets with other formats, currently it is not supported with untar_data, because regardless of your dataset url and its file format, it always appends .tgz when downloading the data.

sgugger commented 5 years ago

Closing this. @odysseus0 you should put all that paragraph in a PR for the docs of untar_data!

odysseus0 commented 5 years ago

@sgugger Do you think that there is any need to make changes to the fact that URLs class actually does not contain full urls, and the fact that untar_data only supports tgz file url in a non-intuitive way?

Should I create a PR for such feature recommendation?

Sorry. I am still not very familiar with the general guidelines of contributing to open source.

sgugger commented 5 years ago

untar_data is supposed to be an internal function with the fastai datasets. Maybe you could create a new function that will be more general? download_url will work for everything already, I guess it's mostly having a uncompress function that would deal with any format that's missing, but that's not an easy feature.

odysseus0 commented 5 years ago

That makes sense to me. I will start experimenting with it.

However, another issue here is that the setup of untar_data the URLs class gives people the misconception that it can take any url that represents compressed file as an argument. For the very minimum, we can add some more notes to the untar_data, but more ideally, we should also name URLs class instead as FastaiDatasets, so that it is clear it is only an internal function for now and should only be used with Fastai Datasets.

I know this is really minuscule and you probably have way more important things to look after. In such case, should I simply start a PR for it?