facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark
Other
705 stars 123 forks source link

ERROR in download-data.sh #11

Closed nxphi47 closed 5 years ago

nxphi47 commented 5 years ago

Thank you for this project and the paper.

I have issue with bash download-data.sh

I think the error happens at line 155 when it tries to download the file https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz

Using web browser, the link appears to be dead.

The line: download_data $DATA/en-hi.tgz "https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz"

Downloading https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
--2019-09-27 17:18:36--  https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving www.cse.iitb.ac.in (www.cse.iitb.ac.in)... 103.21.127.134
Connecting to www.cse.iitb.ac.in (www.cse.iitb.ac.in)|103.21.127.134|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz [following]
--2019-09-27 17:18:38--  https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving anoopk.in (anoopk.in)... 184.168.131.241
Connecting to anoopk.in (anoopk.in)|184.168.131.241|:443... connected.
ERROR: no certificate subject alternative name matches
    requested host name ‘anoopk.in’.
To connect to anoopk.in insecurely, use `--no-check-certificate'.
https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz not successfully downloaded.

Thank you,

ghost commented 5 years ago

I got the same error. I tried to add --no-check-certificate in download_data

# Download data
download_data() {
  CORPORA=$1
  URL=$2

  if [ -f $CORPORA ]; then
    echo "$CORPORA already exists, skipping download"
  else
    echo "Downloading $URL"
    wget --no-check-certificate  $URL -O $CORPORA || rm -f $CORPORA
    if [ -f $CORPORA ]; then
      echo "$URL successfully downloaded."
    else
      echo "$URL not successfully downloaded."
      rm -f $CORPORA
      exit -1
    fi
  fi
}

However, I couldn't download the data correctly. I think the server, anoopk.in, has some problems.

vishrav commented 5 years ago

The server seems to be back again. Please reopen in case you are still observing this issue

ghost commented 5 years ago

The server seems to change the directory which placed the data.

I found this: http://www.cfilt.iitb.ac.in/~moses/iitb_en_hi_parallel/dataset.html

It requires to input some information to download the data. After submitting the form, we can download it, but the URL is not the same as download-data.sh assumed.

ghost commented 5 years ago

If we will access to the URL which has ~anoopk, but not ~moses (https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz), it will redirect to https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz but the anoopk.in server is not stable and still can't download parallel.tgt from there, and even if --no-check-certificate option is added, the downloaded file might not be the correct one.

ghost commented 5 years ago

Okay, I checked the code. https://github.com/facebookresearch/flores/blob/f9f84a239bb6fa9e0168e6faaead93921d56a85a/download-data.sh#L155

This new URL seems to work. Thanks!