chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
523 stars 74 forks source link

WMT Download broken #132

Open kailukowiak opened 1 year ago

kailukowiak commented 1 year ago

I'm trying to run the tutorial in Transformers.jl/example/AttentionIsAllYouNeed/wmt14/train.jl and when I run

julia> using Transformers.Datasets

julia> using Transformers.Datasets: WMT

julia> wmt14 = WMT.GoogleWMT()
Transformers.Datasets.WMT.GoogleWMT()

julia> word_counts = get_vocab(wmt14)
This program has requested access to the data dependency Google-WMT en-de.
which is not currently installed. It can be installed automatically, and you will not see this message again.

"""shows in wmt14 of torchtext
The WMT 2014 English-German dataset, as preprocessed by Google Brain.

Though this download contains test sets from 2015 and 2016, the train set
differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.
"""

contain bpe training set and news testset from 2009~2016 (include origin text,
tokenized, and bpe versions), and also a bpe.32000 and vocab.32000 (merged vocab)

Do you want to download the dataset from https://docs.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 to "/home/<name>/.julia/datadeps/Google-WMT en-de"?
[y/n]
y
ERROR: HTTP.Exceptions.StatusError(404, "HEAD", "/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8&confirm=pbef", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Thu, 02 Mar 2023 06:53:49 GMT
Content-Length: 1642
Strict-Transport-Security: max-age=31536000
Cross-Origin-Opener-Policy: same-origin; report-to="DriveUntrustedContentHttp"
Content-Security-Policy: script-src 'report-sample' 'nonce-WeEppd-iCOhoEnLwLNRXaA' 'unsafe-inline';object-src 'none';base-uri 'self';report-uri /_/DriveUntrustedContentHttp/cspreport;worker-src 'self', require-trusted-types-for 'script';report-uri /_/DriveUntrustedContentHttp/cspreport
Accept-CH: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-WoW64, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*
Report-To: {"group":"DriveUntrustedContentHttp","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/DriveUntrustedContentHttp/external"}]}
Server: ESF
X-XSS-Protection: 0
X-Content-Type-Options: nosniff
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

""")
Stacktrace:
  [1] download_gdrive(url::String, localdir::String)
    @ Fetch ~/.julia/packages/Fetch/6DlaY/src/gdrive.jl:63
  [2] gdownload(url::String, localdir::String)
    @ Fetch ~/.julia/packages/Fetch/6DlaY/src/gdrive.jl:113
  [3] run_fetch
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:99 [inlined]
  [4] download(datadep::DataDeps.DataDep{String, String, typeof(Fetch.gdownload), typeof(DataDeps.unpack)}, localdir::String; remotepath::String, i_accept_the_terms_of_use::Nothing, skip_checksum::Bool)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:78
  [5] download
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:63 [inlined]
  [6] handle_missing
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution_automatic.jl:10 [inlined]
  [7] _resolve
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:83 [inlined]
  [8] resolve(datadep::DataDeps.DataDep{String, String, typeof(Fetch.gdownload), typeof(DataDeps.unpack)}, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:29
  [9] resolve(datadep_name::String, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:54
 [10] resolve
    @ ~/.julia/packages/DataDeps/ae6dT/src/resolution.jl:73 [inlined]
 [11] #get_vocab#2
    @ ~/.julia/packages/Transformers/nIgPX/src/datasets/translate/google_wmt.jl:49 [inlined]
 [12] get_vocab(::Transformers.Datasets.WMT.GoogleWMT)
    @ Transformers.Datasets.WMT ~/.julia/packages/Transformers/nIgPX/src/datasets/translate/google_wmt.jl:47
 [13] top-level scope
    @ REPL[9]:1

I presume this is due to the download location moving but I'm not sure.

chengchingwen commented 1 year ago

Yeah, the link seems to be down. We'll need to find a different source for the dataset.

kailukowiak commented 1 year ago

Would this link work? https://www.statmt.org/europarl/v7/de-en.tgz I got it from https://www.statmt.org/europarl/

If that would work I'd be happy to throw in a trivial PR if you'd like.

chengchingwen commented 1 year ago

It seems to be a different corpus? It would be better to find a new official source for the WMT dataset. Personally I don't have a strong intent to add new dataset, but you could add a new example code that also handle the download and use the europarl corpus if you want.

tobefreeman commented 1 year ago

I am unable to pre-compile Transformers.jl as of this week. Does anyone know why?

chengchingwen commented 1 year ago

@tobefreeman Please open a new issue and provide the error message.

tobefreeman commented 1 year ago

@tobefreeman Please open a new issue and provide the error message.

My issue is not repeatable. Sorry, @chengchingwen.