PetrochukM / PyTorch-NLP

Basic Utilities for PyTorch Natural Language Processing (NLP)
https://pytorchnlp.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.21k stars 258 forks source link

Http 403 when calling FastText() #70

Closed crleblanc closed 5 years ago

crleblanc commented 5 years ago

Expected Behavior

Calling FastText() should successfully download data from an S3 bucket.

Actual Behavior

Making a call to FastText() is currently raising the error urllib.error.HTTPError: HTTP Error 403: Forbidden

Steps to Reproduce the Problem

  1. Run a clean Python 3.6 REPL using Docker with the command docker run -it --rm python:3.6 bash. This should work the same in Python 3.6 and 3.7.
  2. Install latest pytorch-nlp package: pip install torchvision pytorch-nlp
  3. Run this code to get the HTTPError:
    from torchnlp.word_to_vector import FastText
    vectors = FastText()
  4. This will produce this error:
    wiki.en.vec: 0.00B [00:00, ?B/s]
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in __init__
    super(FastText, self).__init__(name, url=url, **kwargs)
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 71, in __init__
    self.cache(name, cache, url=url)
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 110, in cache
    download_file_maybe_extract(url=url, directory=cache, check_files=[name])
    File "/usr/local/lib/python3.6/site-packages/torchnlp/download.py", line 160, in download_file_maybe_extract
    urllib.request.urlretrieve(url, filename=filepath, reporthook=_reporthook(t))
    File "/usr/local/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
    File "/usr/local/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
    File "/usr/local/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
    File "/usr/local/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
    File "/usr/local/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
    File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
    File "/usr/local/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 403: Forbidden
  5. Using the `aligned=True' option gives a 404:
    >>> FastText(aligned=True)
    wiki.multi.en.vec: 0.00B [00:00, ?B/s]
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in __init__
    super(FastText, self).__init__(name, url=url, **kwargs)
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 71, in __init__
    self.cache(name, cache, url=url)
    File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 110, in cache
    download_file_maybe_extract(url=url, directory=cache, check_files=[name])
    File "/usr/local/lib/python3.6/site-packages/torchnlp/download.py", line 160, in download_file_maybe_extract
    urllib.request.urlretrieve(url, filename=filepath, reporthook=_reporthook(t))
    File "/usr/local/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
    File "/usr/local/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
    File "/usr/local/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
    File "/usr/local/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
    File "/usr/local/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
    File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
    File "/usr/local/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 404: Not Found

The S3 URL in question is from https://github.com/PetrochukM/PyTorch-NLP/blob/c432c3e6991352443927da06a59c404e8ea44826/torchnlp/word_to_vector/fast_text.py#L74. Attempting to download this file from the AWS CLI using language='en' gives this error:

$ aws s3 cp s3://fasttext-vectors/wiki.en.vec .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Looking at this bucket in the AWS console gives the error "All access to this object has been disabled". Hopefully it's just a matter of adjusting the bucket permissions/policy.

PetrochukM commented 5 years ago

Hi There!

Looks like the FastText GitHub has been updated... https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md

Their latest commit message: image

Feel free to submit a PR with the new url!

crleblanc commented 5 years ago

Sounds good, I'll make a new PR for this today after I give it a quick manual test for both URLs.