Closed mlsmall closed 5 years ago
What is going on here
Here I will explain the current behavior of untar_data
. As for if such behavior is desirable, people who have an opinion can express it here.
Here is an example of the use of untar_data
in the lesson 1 notebook:
path = untar_data(URLs.PETS)
From the naming of the variable, it is common to assume that URLs.PETS
is an actual url. Looking at the documentation of untar_data
, it shows that
untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True)
"Download url if it doesn't exist to fname and un-tgz to folder dest"
So it also suggests that the first argument of untar_data
is a url.
If you look into the exact content of URLs.PETS
, which is simply a constant field of the class URLs
, you will find that it is defined as
'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet'
However, if you try to visit that url or download the file using command like wget
, you will find that this url is invalid. Indeed, the actual address is
'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz'
with .tgz
added at the end.
What I do find confusing here is that, though both the naming URLs
and the docs of untar_data
suggest that its first argument is a url, URLs.PETS
is actually not a valid url. Digging a bit deeper, we found in the source code of download_data
function called by untar_url
function the following lines
if not fname.exists():
print(f'Downloading {url}')
download_url(f'{url}.tgz', fname)
So it is clear that the url
variable here is actually not a true url, but one with .tgz
missing at the end. We can also infer that the url
parameter untar_data
takes and every field of URLs
class is actually the true url with .tgz
removed from the end.
Now going back to your post, to download any dataset ending with .tgz
, simply remove the .tgz
and throw it into untar_data
and it should work. For datasets with other formats, currently it is not supported with untar_data
, because regardless of your dataset url
and its file format, it always appends .tgz
when downloading the data.
Closing this. @odysseus0 you should put all that paragraph in a PR for the docs of untar_data
!
@sgugger Do you think that there is any need to make changes to the fact that URLs
class actually does not contain full urls, and the fact that untar_data
only supports tgz
file url in a non-intuitive way?
Should I create a PR for such feature recommendation?
Sorry. I am still not very familiar with the general guidelines of contributing to open source.
untar_data
is supposed to be an internal function with the fastai datasets.
Maybe you could create a new function that will be more general? download_url
will work for everything already, I guess it's mostly having a uncompress
function that would deal with any format that's missing, but that's not an easy feature.
That makes sense to me. I will start experimenting with it.
However, another issue here is that the setup of untar_data
the URLs
class gives people the misconception that it can take any url that represents compressed file as an argument. For the very minimum, we can add some more notes to the untar_data
, but more ideally, we should also name URLs
class instead as FastaiDatasets
, so that it is clear it is only an internal function for now and should only be used with Fastai Datasets.
I know this is really minuscule and you probably have way more important things to look after. In such case, should I simply start a PR for it?
Untar_data function gives an error "not a gzip file" when trying to download and untar a file. Describe the bug
To Reproduce
untar_data('https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz') untar_data('http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz') untar_data('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz') Expected behavior
I expect the file to be downloaded and untarred Screenshots
Additional context
Getting an error that says "not a gzip file"