wmt_dataset download failed

chloejiwon commented 3 years ago

Expected Behavior

I tried to follow example of pytorch nlp documentation with wmt14 dataset. (https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
download wmt dataset successfully

Actual Behavior

wmt_dataset [DOWNLOAD_FAILED] occurs.

Steps to Reproduce the Problem

install pytorch-nlp 0.5.0
from torchnlp.datasets import wmt_dataset

train=wmt_dataset(train=True)

>>> train = wmt_dataset(train=True)
tar: Error opening archive: Unrecognized archive format
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/torchnlp/datasets/wmt.py", line 63, in wmt_dataset
download_file_maybe_extract(
File "/usr/local/lib/python3.9/site-packages/torchnlp/download.py", line 170, in download_file_maybe_extract
raise ValueError('[DOWNLOAD FAILED] `*check_files` not found')
ValueError: [DOWNLOAD FAILED] `*check_files` not found

ro-ko commented 2 years ago

In torchnlp/download.py

def _download_file_from_drive(filename, url):  # pragma: no cover
    """ Download filename from google drive unless it's already in directory.

    Args:
        filename (str): Name of the file to download to (do nothing if it already exists).
        url (str): URL to download from.
    """
    confirm_token = None

    # Since the file is big, drive will scan it for virus and take it to a
    # warning page. We find the confirm token on this page and append it to the
    # URL to start the download process.
    confirm_token = None
    session = requests.Session()
    response = session.get(url, stream=True)
    for k, v in response.cookies.items():
        if k.startswith("download_warning"):
            confirm_token = v

    if confirm_token:
        url = url + "&confirm=" + confirm_token

    logger.info("Downloading %s to %s" % (url, filename))

    response = session.get(url, stream=True)
    # Now begin the download.
    chunk_size = 16 * 1024
    with open(filename, "wb") as f:
        for chunk in response.iter_content(chunk_size):
            if chunk:
                f.write(chunk)

    # Print newline to clear the carriage return from the download progress
    statinfo = os.stat(filename)
    logger.info("Successfully downloaded %s, %s bytes." % (filename, statinfo.st_size))

I checked the not found *check_files

Result

data/wmt16_en_de/train.tok.clean.bpe.32000.en Extracting data/wmt16_en_de/wmt16_en_de.tar.gz tar: Error opening archive: Unrecognized archive format data/wmt16_en_de/train.tok.clean.bpe.32000.en 'data/wmt16_en_de/wmt16_en_de.tar.gz' file forms HTML document text, ASCII text

open file url 'https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8' in documentation with wet dataset. it was 404 found page.

this bug is occurred by documentation wmt data url.

maximus12793 commented 2 years ago

Any update on this?

PetrochukM / PyTorch-NLP