Open chloejiwon opened 3 years ago
In torchnlp/download.py
def _download_file_from_drive(filename, url): # pragma: no cover
""" Download filename from google drive unless it's already in directory.
Args:
filename (str): Name of the file to download to (do nothing if it already exists).
url (str): URL to download from.
"""
confirm_token = None
# Since the file is big, drive will scan it for virus and take it to a
# warning page. We find the confirm token on this page and append it to the
# URL to start the download process.
confirm_token = None
session = requests.Session()
response = session.get(url, stream=True)
for k, v in response.cookies.items():
if k.startswith("download_warning"):
confirm_token = v
if confirm_token:
url = url + "&confirm=" + confirm_token
logger.info("Downloading %s to %s" % (url, filename))
response = session.get(url, stream=True)
# Now begin the download.
chunk_size = 16 * 1024
with open(filename, "wb") as f:
for chunk in response.iter_content(chunk_size):
if chunk:
f.write(chunk)
# Print newline to clear the carriage return from the download progress
statinfo = os.stat(filename)
logger.info("Successfully downloaded %s, %s bytes." % (filename, statinfo.st_size))
I checked the not found *check_files
Result
data/wmt16_en_de/train.tok.clean.bpe.32000.en Extracting data/wmt16_en_de/wmt16_en_de.tar.gz tar: Error opening archive: Unrecognized archive format data/wmt16_en_de/train.tok.clean.bpe.32000.en
'data/wmt16_en_de/wmt16_en_de.tar.gz' file forms HTML document text, ASCII text
open file url 'https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8' in documentation with wet dataset. it was 404 found page.
this bug is occurred by documentation wmt data url.
Any update on this?
Expected Behavior
Actual Behavior
Steps to Reproduce the Problem
from torchnlp.datasets import wmt_dataset
train=wmt_dataset(train=True)