Open drdozer opened 3 years ago
You can find some simple usages in the toy example.
Basically,
using Transformers
using Transformers.Datasets # utilities for dataset
using Transformers.Datasets: IWSLT # IWSLT datasets
# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :en
dst_lang = :de
iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset
# get vocabulary from training data
vocab = get_vocab(iwslt2016)
# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does
# get datas
batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences
Once you run through all the data, get_batch
will return an empty vector, then you can recreate the dataset object.
Above example fails with following error message:
┌ Info: Downloading
│ source = "https://wit3.fbk.eu/archive/2016-01//texts/en/de/en-de.tgz"
│ dest = "/home/markus/.julia/datadeps/IWSLT2016 en-de/en-de.tgz"
│ progress = NaN
│ time_taken = "0.05 s"
│ time_remaining = "NaN s"
│ average_speed = "2.141 MiB/s"
│ downloaded = "105.240 KiB"
│ remaining = "∞ B"
└ total = "∞ B"
ERROR: LoadError: HTTP.ExceptionRequest.StatusError(404, "GET", "/archive/2016-01//texts/en/de/en-de.tgz", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Sun, 13 Feb 2022 09:22:44 GMT
Cross-Origin-Opener-Policy: unsafe-none
Content-Security-Policy: base-uri 'self';object-src 'none';report-uri /_/view/cspreport;script-src 'nonce-iuQ6rOqUw/NwssS2azzWNQ' 'unsafe-inline' 'unsafe-eval';worker-src 'self';frame-ancestors https://google-admin.corp.google.com/
Referrer-Policy: origin
Server: ESF
X-XSS-Protection: 0
X-Content-Type-Options: nosniff
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked
Looking at the website of IWSLT, it seems that the datasets moved to Google Drive instead.
Looks like they no longer provide file links for specific translation pair, we would need to rewrite the datadeps base on that
I thought I could fix this quickly by changing the download link and adapt the post_fetch_method to search for the translation pairs in the right subfolder, but it seems like DataDeps.jl doesn't support downloading from GoogleDrive (or maybe I did something wrong). From a quick glance at DataDeps.jl, I found a issue discussing this topic.
@maj0e move issue to #85
Hi - I want to play around with some language translation tasks and saw that you've got
Transformers.Datasets.IWSLT.IWSLT2016
. How do I interact with this to get data that I can train a model on? I couldn't find anything in the documentation to help me out.