As mentioned in #72 and #85 the IWSLT2016 dataset moved to google drive and all language pairs are now in a single archive.
I've rewritten the post_fetch_method to extract the nested archives for the requested language pair.
I also found a bug in an error message in the tunefile function. Here the "fr-en" language was accidently hardcoded, which lead to the download of "fr-en" language pair, when the error was triggered.
I tested the changes with the following script:
#test_iwslt.jl
using Transformers
using Transformers.Datasets # utilities for dataset
using Transformers.Datasets: IWSLT # IWSLT datasets
# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :de
dst_lang = :en
iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset
# get vocabulary from training data
vocab = get_vocab(iwslt2016)
# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does
batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences
...and it works for the language pairs I've tested ("en-de", "de-en", "fr-en" and "en-fr").
Fixes: #85
As mentioned in #72 and #85 the IWSLT2016 dataset moved to google drive and all language pairs are now in a single archive.
I've rewritten the post_fetch_method to extract the nested archives for the requested language pair.
I also found a bug in an error message in the tunefile function. Here the "fr-en" language was accidently hardcoded, which lead to the download of "fr-en" language pair, when the error was triggered.
I tested the changes with the following script:
...and it works for the language pairs I've tested ("en-de", "de-en", "fr-en" and "en-fr").
Regards, maj0e