chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
526 stars 75 forks source link

how to use the IWSLT2016 dataset #72

Open drdozer opened 3 years ago

drdozer commented 3 years ago

Hi - I want to play around with some language translation tasks and saw that you've got Transformers.Datasets.IWSLT.IWSLT2016. How do I interact with this to get data that I can train a model on? I couldn't find anything in the documentation to help me out.

chengchingwen commented 3 years ago

You can find some simple usages in the toy example.

Basically,

using Transformers
using Transformers.Datasets # utilities for dataset 
using Transformers.Datasets: IWSLT # IWSLT datasets

# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :en 
dst_lang = :de 

 iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset

# get vocabulary from training data
vocab = get_vocab(iwslt2016)

# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does

# get datas
batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences

Once you run through all the data, get_batch will return an empty vector, then you can recreate the dataset object.

maj0e commented 2 years ago

Above example fails with following error message:

┌ Info: Downloading
│   source = "https://wit3.fbk.eu/archive/2016-01//texts/en/de/en-de.tgz"
│   dest = "/home/markus/.julia/datadeps/IWSLT2016 en-de/en-de.tgz"
│   progress = NaN
│   time_taken = "0.05 s"
│   time_remaining = "NaN s"
│   average_speed = "2.141 MiB/s"
│   downloaded = "105.240 KiB"
│   remaining = "∞ B"
└   total = "∞ B"
ERROR: LoadError: HTTP.ExceptionRequest.StatusError(404, "GET", "/archive/2016-01//texts/en/de/en-de.tgz", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Sun, 13 Feb 2022 09:22:44 GMT
Cross-Origin-Opener-Policy: unsafe-none
Content-Security-Policy: base-uri 'self';object-src 'none';report-uri /_/view/cspreport;script-src 'nonce-iuQ6rOqUw/NwssS2azzWNQ' 'unsafe-inline' 'unsafe-eval';worker-src 'self';frame-ancestors https://google-admin.corp.google.com/
Referrer-Policy: origin
Server: ESF
X-XSS-Protection: 0
X-Content-Type-Options: nosniff
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

Looking at the website of IWSLT, it seems that the datasets moved to Google Drive instead.

chengchingwen commented 2 years ago

Looks like they no longer provide file links for specific translation pair, we would need to rewrite the datadeps base on that

maj0e commented 2 years ago

I thought I could fix this quickly by changing the download link and adapt the post_fetch_method to search for the translation pairs in the right subfolder, but it seems like DataDeps.jl doesn't support downloading from GoogleDrive (or maybe I did something wrong). From a quick glance at DataDeps.jl, I found a issue discussing this topic.

chengchingwen commented 2 years ago

@maj0e move issue to #85