Open seekingpeace opened 4 years ago
PairedTextData
directly reads from files. Please take a look at its hparams, where you can set file paths for both the source-side and target-side datasets.
(and 3.) I'm not sure what you want to achieve here. My guess is that you want to use the BERT and GPT2 tokenizers to tokenizer the source and target datasets respectively, is this correct?
PairedTextData
has a number of processing options built-in, such as delimiter for tokenization ("delimiter"
), maximum sentence length ("max_seq_length"
). If you want to do additional processing, you have two options:
"other_transformations"
.PairedTextData
and override the process
function. Only do this if you understand how PairedTextData
works.However, in your case, I think the easier way would be to write the data loader yourself, as tokenizers direct convert untokenized strings to token IDs. @gpengzhi can you help writing an example for this use case?
Thanks @huzecong for the reply, To make it work i made separate vocab files for both gpt2 and bert. stored it in text and passed a part of tokeniser as mentioned below:
tokenizer_gpt2 = tx.data.GPT2Tokenizer(
pretrained_model_name='gpt2-small')
tokenizer_bert = tx.data.BERTTokenizer(
pretrained_model_name='bert-base-uncased')
def token_transform_bert(arr):
arr_str = ' '.join(arr)
ret_arr = tokenizer_bert.map_text_to_token(arr_str)
return ret_arr
def token_transform_gpt2(arr):
arr_str = ' '.join(arr)
ret_arr = tokenizer_gpt2.map_text_to_token(arr_str)
return ret_arr
data_hparams={
'train':{
'source_dataset': {'files': 'exp/train_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/train_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 40,
"allow_smaller_final_batch": True,
"shuffle": True,
"num_parallel_calls":3
},
'test':{
'source_dataset': {'files': 'exp/test_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/test_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 12
},
'valid':{
'source_dataset': {'files': 'exp/valid_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/valid_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 12
}
}
After this an exception was raised that these special tokens already exists in vocab. So had to remove that from vocabulary.py class. Also, monkey patched paired_text_data.py since there was no way to pass pad and unk to PairedTextData
self._src_vocab = Vocab(src_hparams.vocab_file,
bos_token=src_hparams.bos_token,
eos_token=src_hparams.eos_token,
pad_token='[PAD]',
unk_token='[UNK]')
self._tgt_vocab = Vocab(tgt_hparams["vocab_file"],
bos_token=tgt_bos_token,
eos_token=tgt_eos_token,
pad_token='<|endoftext|>',
unk_token='<|endoftext|>')
I Think:
vocab = [self._pad_token, self._bos_token, self._eos_token, self._unk_token] + vocab
my whole vocab will shift to right giving incorrect results.Thank you for your feedback! These are all valuable suggestions and I think we could add them. We're actually discussing the possibility to deprecate the Vocab
class and switch everything to tokenizer-based, or at least provide interfaces in the data modules to use tokenizers instead of vocab. @gpengzhi Could you try working on this after the holidays?
Yes. I think we should support this feature. Since pre-trained
tokenizers already take care of the corresponding vocabulary files and the special tokens, it is unnecessary to require vocabulary file and the special tokens when people use PairedTextData
. I will think about this enhancement in our data module.
So, @gpengzhi @huzecong Ideally will there be a tokeniser builder class which can accept a pretrained tokeniser or a new tokeniser. This new tokeniser for src and tgt be passed to data modules for processing the data? This can be a cool new feature. Will make many things seamless.
Hi, while trying to use the following snippet:
In this example
TIA