chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
523 stars 74 forks source link

Update IWSLT dataset link #86

Closed maj0e closed 2 years ago

maj0e commented 2 years ago

Fixes: #85

As mentioned in #72 and #85 the IWSLT2016 dataset moved to google drive and all language pairs are now in a single archive.

I've rewritten the post_fetch_method to extract the nested archives for the requested language pair.

I also found a bug in an error message in the tunefile function. Here the "fr-en" language was accidently hardcoded, which lead to the download of "fr-en" language pair, when the error was triggered.

I tested the changes with the following script:

#test_iwslt.jl
using Transformers
using Transformers.Datasets # utilities for dataset 
using Transformers.Datasets: IWSLT # IWSLT datasets

# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :de 
dst_lang = :en 

iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset

# get vocabulary from training data
vocab = get_vocab(iwslt2016)

# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does

batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences

...and it works for the language pairs I've tested ("en-de", "de-en", "fr-en" and "en-fr").

Regards, maj0e

chengchingwen commented 2 years ago

Looks great! Thanks!