JayYip / m3tl

BERT for Multitask Learning
https://jayyip.github.io/m3tl/
Apache License 2.0
545 stars 125 forks source link

How to prepare two sequences as input for bert-multitask-learning? #33

Open rudra0713 opened 4 years ago

rudra0713 commented 4 years ago

Hi, I have a dataset that involves 2 sequences and the task is classifying the sequence pair. I am not sure how to prepare the input in this case. So far, I have been working with only one sequence where I used the following format:

["Everyone", "should", "be", "happy", "."] How do I extend this for 2 sequences? Do I have to insert a "SEP" token myself?

JayYip commented 4 years ago

Now you reminded me... Sorry, it's not implemented.

https://github.com/JayYip/bert-multitask-learning/blob/9fe97739194f801e539efbadbaaf97a9c945eaaa/bert_multitask_learning/create_generators.py#L47

JayYip commented 4 years ago

Sorry, I misread your question. You can prepare something like:

@preprocessing_fn
def proc_fn(params, mode):
    return [{'a': ["Everyone", "should", "be", "happy", "."], 'b': ["you're", "right"]}], ['true']
rudra0713 commented 4 years ago

I prepared two sequences following your format, Here's an example: {'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']} After printring tokens in add_special_tokens_with_seqs function in utils.py, I got this, tokens -> ['[CLS]', 'a', 'b', '[SEP]'] I was expecting 'a' and 'b' to be replaced by the original sequences. Is this okay? For a single sequence task, when I printed tokens, I got the desired output, tokens -> ['[CLS]', 'marriage', 'came', 'from', 'religion', '.', '[SEP]']

JayYip commented 4 years ago

Maybe it's a bug. Could you confirm that the example argument of create_single_problem_single_instance is a tuple like below?

({'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}, 'some label')
rudra0713 commented 4 years ago

After adding this print, this is what I have found if the Mode in the preprocessing function is 'train' or 'eval', the output of example aligns with what you mentioned, `example (from create_single_problem_single_instance function) -> ({'a': ['we', 'Should', 'be', 'optimistic', 'about', 'the', 'future', '.'], 'b': ['Anything', 'that', 'improves', 'rush', 'hour', 'traffic', 'ca', "n't", 'be', 'all', 'that', 'bad', '.']}, 0) tokens (from add_special_tokens_with_seqs function)-> ['[CLS]', 'we', 'should', 'be', 'op', '##timi', '##stic', 'about', 'the', 'future', '.', '[SEP]', 'anything', 'that', 'improve', '##s', 'rus', '##h', 'hour', 'traffic', 'ca', 'n', "##'", '##t', 'be', 'all', 'that', 'bad', '.', '[SEP]']

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this -> tokens -> ['[CLS]', 'a', 'b', '[SEP]'] Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy. `

JayYip commented 4 years ago

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this -> tokens -> ['[CLS]', 'a', 'b', '[SEP]']

This is a bug. I'll fix it later.

Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.

That's weird. Maybe it's caused by another bug. Could you provide more info?

JayYip commented 4 years ago

Sorry, accidentally closed. Reopen now.