MLM+TLM model fails at STS-B task

Akella17 commented 5 years ago

The pretrained MLM+TLM model achieves < 30% Pearson correlation with human scores. With CBOW at 60% and BERT at 86%, this score seems low for the STS-B task.

I am not sure if there is a mistake in the implementation from my end or if the MLM+TLM model does not work for STS-B task. Can someone confirm this @aconneau, @glample

glample commented 5 years ago

I obtain the following results on the validation set:

sts-b_valid_prs : 89.1727
sts-b_valid_spr : 89.1524

but this is for a model trained on English exclusively that we will release soon, not the MLM + TLM one.

The released MLM+TLM model should be much better than 30% though. What hyper-parameters do you use for fine-tuning?

Akella17 commented 5 years ago

Thanks for the quick reply! The hyperparameter settings are the ones suggested in this documentation. My exact implementation is as follows:

Clone the XLM repo and run ./get-data-glue.sh command.
Download the pre-trained mlm_tlm_xnli15_1024.pth model.
Run the glue-xnli.py file with the below-mentioned settings.

python glue-xnli.py
--exp_name test_xnli_mlm_tlm             # experiment name
--dump_path ./dumped/                    # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth     # model location
--data_path ./data/processed/XLM15       # data location
--transfer_tasks STS-B                   # transfer tasks (XNLI or GLUE tasks)
--optimizer adam,lr=0.000005             # optimizer
--batch_size 8                           # batch size
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch
--max_len 256                            # max number of words in sentences
--max_vocab 95000                        # max number of words in vocab

I also tried using LR = 1e-4, 2e-5 but the score always converged to ~20-25%.

Akella17 commented 5 years ago

@glample By replacing the MLM+TLM (mlm_tlm_xnli15_1024.pth) model with English-German MLM (mlm_ende_1024.pth) model, I am able to get a score of around sts-b_valid_prs : 70%. I have also tried BERT (which is nearly the same as MLM on English alone) and was able to get sts-b_valid_prs : 88%.

Maybe the multi-language MLM training is somehow affecting the model's ability to understand the semantics of one particular language. Would like to hear your opinions @glample @aconneau

glample commented 5 years ago

I don't think that another language can have such an effect. Probably the domain is the main reason. We also tried to train BERT English only on news (mlm_ende_1024.pth is trained on NewsCrawl) and the performance was clearly lower on the GLUE tasks than when training with Wikipedia + Toronto book corpus (I forgot by how much, but this was really significant). So I'm not surprised by all these results. A bit surprised about the score you get for mlm_tlm_xnli15_1024.pth though. I guess that with 15 languages the cross-lingual aspect can start becoming detrimental, but again I'm pretty sure the domain remains the most important issue.

Akella17 commented 5 years ago

@glample Can you mention the dataset used to train the English exclusive MLM model (one that achieves sts-b_valid_prs : 89.1727)?

glample commented 5 years ago

It was Wikipedia + Toronto book corpus, like in the original BERT paper.

Akella17 commented 5 years ago

@glample Oh okay. Now it makes sense. Thanks for the clarification.

If Wikipedia + Toronto book corpus gives better performance on GLUE tasks, then why was MLM+TLM model trained on NewsCrawl corpus (for English MLM task)? I am just asking this to avoid training an XLM from scratch (in case Wikipedia + Toronto book corpus does not work well in MLM+TLM settings along with 14 other languages). My downstream task needs a cross-lingual language model that can capture high-level similarity in sentences across languages.

glample commented 5 years ago

So the MLM+TLM was also trained on the Wiki+toronto book for the monolingual English side. But for other languages, we had to find something else. We used wikipedia for all languages, and for the TLM (English-XX) we used parallel data which was from other domains, it was a mix of lots of parallel data we found like OpenSubtitles, Europarl, MultiUN, etc. which altered the proportion of the Wiki+TBC used to train the English side.

Akella17 commented 5 years ago

@glample So, if I understood correctly, can I expect a pure MLM model trained on Wiki alone for 14 languages and Wiki+toronto book for English to provide a considerable improvement over XLM+MLM model for cross-lingual sentence similarity tasks?

glample commented 5 years ago

Over MLM + TLM you mean? Yes, I think it should be quite better. I'm very surprised by the 30% though, it seems surprisingly low. Will try to have a look soon.

Akella17 commented 5 years ago

@glample Since the parallel corpus used for the TLM training might result in domain issues for downstream tasks, I was wondering if training with MLM objective alone on Wiki of all the 15 languages (no parallel data) might be better for downstream language understanding tasks (monolingual or cross-lingual).

Hey, will you be releasing the MLM (Wiki 15 languages) model? I am eager to try it for STS-B task. This way we can know for sure if the problem with the performance on STS-B is a result of a domain issue or cross-lingual training.

glample commented 5 years ago

I just uploaded the MLM only model for the 15 languages: https://github.com/facebookresearch/XLM#pretrained-models Can you try to see if this works better for you than the MLM + TLM?

Akella17 commented 5 years ago

@glample The MLM model for 15 languages gives sts-b_valid_prs : 50%.

glample commented 5 years ago

That's interesting. I guess the domain is more important than I thought. I remember it was important to add the "Toronto book corpus" in English, maybe this is what is missing for the other languages. Or maybe the cross-lingual aspect is more detrimental than expected.

Akella17 commented 5 years ago

@glample Is there any empirical result that determines the quality of embeddings generated by the XLM model without estimation of additional parameters or finetuning? Like maybe some degree of correlation with a human annotated score or a standard evaluation metric.

I tried to check the performance of XLM for evaluating text generation by following the steps described in BertScore (https://arxiv.org/abs/1904.09675). Strangely, human annotated scores appear to be totally uncorrelated with XLM embeddings.

Pearson correlation coefficient (measures linear relationship)      : -0.125
Spearman's rank-order correlation (measures monotonic relationship) : -0.129

The success of XLM at various tasks like XNLI, UNMT, etc might mean that there must exist some kind of non-linear relationship between human scores and XLM embeddings. What is your opinion on this and is there a way to use XLM as a scoring metric?

aconneau commented 5 years ago

@Akella17 : There must be a bug in your fine-tuning. I ran the evaluation of XLM_15 for MLM and MLM+TLM on STS-B and I obtained 86.4/86.9 (pearson/spearman) for the latter (mBERT is at 86%, although it has more languages).

Your result might be due to a preprocessing issue? I would like to understand how you obtained that result as it might be related to a bug on the github.

Akella17 commented 5 years ago

@aconneau @glample My exact implementation is as follows:

Clone the XLM repo and run ./get-data-glue.sh command.
Download the pre-trained mlm_tlm_xnli15_1024.pth model.
Run the glue-xnli.py file with the following settings.

python glue-xnli.py
--exp_name test_xnli_mlm_tlm             # experiment name
--dump_path ./dumped/                    # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth     # model location
--data_path ./data/processed/XLM15       # data location
--transfer_tasks STS-B                   # transfer tasks (XNLI or GLUE tasks)
--optimizer adam,lr=0.000005             # optimizer
--batch_size 8                           # batch size
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch
--max_len 256                            # max number of words in sentences
--max_vocab 95000                        # max number of words in vocab

I also tried using LR = 1e-4, 2e-5 but the score always converged to ~20-25%.

glample commented 5 years ago

@Akella17 this is not expected. Can you send us your training log so we can have a look?

aconneau commented 5 years ago

Thanks @Akella17 for the feedback, there seems to be a problem indeed, I will look at it and see if I find a problem. In the meantime, as Guillaume mentioned, could you provide us with your log? Thanks

Akella17 commented 5 years ago

@aconneau @glample Sorry for the late reply. There must have been some problem with the older version of the repository that led to bad STS-B score. The latest version of the repository works fine on STS-B task.

My target downstream task was to reproduce BERTSCORE paper on XLM15. While BERTSCORE does n-gram matching over the output embeddings of the two sentences (mt_output and reference), I wanted to solve the same problem with regression (like STS-B) and finetuning. I was able to achieve the BERT n-gram score (mt_output and reference) on XLM15, but the regression score stayed nearly uncorrelated with human annotated scores. Also, the cross-lingual (source and mt_output) n-gram score is not great either (this, however, can be a separate issue).

My code for this is a slight modification of glue-xnli.py and src\evaluation\glue.py files,

Main file:

# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

import os
import argparse

from src.utils import bool_flag, initialize_exp
from src.evaluation.Metric import Metric
from src.model.embedder import SentenceEmbedder

# parse parameters
parser = argparse.ArgumentParser(description='Train on GLUE or XNLI')

# main parameters
parser.add_argument("--exp_name", type=str, default="test_Metric_task",
                    help="Experiment name")
parser.add_argument("--dump_path", type=str, default="./dumped/",
                    help="Experiment dump path")
parser.add_argument("--exp_id", type=str, default="",
                    help="Experiment ID")
parser.add_argument("--ref_flag", type=bool_flag, default=False,
                    help="Weather to train on reference or source sentences")

# float16
parser.add_argument("--fp16", type=bool_flag, default=False,
                    help="Run model with float16")

# evaluation task / pretrained model
parser.add_argument("--GPU_ids", type=str, default="0",
                    help="Which GPUs to run the code on")
parser.add_argument("--transfer_tasks", type=str, default="de-en",
                    help="Transfer tasks, example: 'MNLI-m,RTE,XNLI' ")
parser.add_argument("--model_path", type=str, default="mlm_en_2048.pth",
                    help="Model location")

# data
parser.add_argument("--data_path", type=str, default="",
                    help="Data path")
parser.add_argument("--max_vocab", type=int, default=95000,
                    help="Maximum vocabulary size (-1 to disable)")
parser.add_argument("--min_count", type=int, default=0,
                    help="Minimum vocabulary count")

# batch parameters
parser.add_argument("--max_len", type=int, default=256,
                    help="Maximum length of sentences (after BPE)")
parser.add_argument("--group_by_size", type=bool_flag, default=False,
                    help="Sort sentences by size during the training")
parser.add_argument("--batch_size", type=int, default=8,
                    help="Number of sentences per batch")
parser.add_argument("--max_batch_size", type=int, default=0,
                    help="Maximum number of sentences per batch (used in combination with tokens_per_batch, 0 to disable)")
parser.add_argument("--tokens_per_batch", type=int, default=-1,
                    help="Number of tokens per batch")

# model / optimization
parser.add_argument("--finetune_layers", type=str, default='0:_1',
                    help="Layers to finetune. 0 = embeddings, _1 = last encoder layer")
parser.add_argument("--weighted_training", type=bool_flag, default=False,
                    help="Use a weighted loss during training")
parser.add_argument("--dropout", type=float, default=0,
                    help="Fine-tuning dropout")
parser.add_argument("--optimizer", type=str, default="adam,lr=0.000005",
                    help="Optimizer")
parser.add_argument("--n_epochs", type=int, default=250,
                    help="Maximum number of epochs")
parser.add_argument("--epoch_size", type=int, default=20000,
                    help="Epoch size (-1 for full pass over the dataset)")

# debug
parser.add_argument("--debug_train", type=bool_flag, default=False,
                    help="Use valid sets for train sets (faster loading)")
parser.add_argument("--debug_slurm", type=bool_flag, default=False,
                    help="Debug multi-GPU / multi-node within a SLURM job")

# parse parameters
params = parser.parse_args()

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=params.GPU_ids

params.data_path = './data/processed/'+params.data_path
if params.tokens_per_batch > -1:
    params.group_by_size = True

# check parameters
assert os.path.isdir(params.data_path)
assert os.path.isfile(params.model_path)

# tasks
params.transfer_tasks = params.transfer_tasks.split(',')
assert len(params.transfer_tasks) > 0

# reload pretrained model
embedder = SentenceEmbedder.reload(params.model_path, params)

# reload langs from pretrained model
params.n_langs = embedder.pretrain_params['n_langs']
params.id2lang = embedder.pretrain_params['id2lang']
params.lang2id = embedder.pretrain_params['lang2id']

# initialize the experiment / build sentence embedder
logger = initialize_exp(params)
scores = {}

# prepare trainers / evaluators
# glue = GLUE(embedder, scores, params)
# xnli = XNLI(embedder, scores, params)
metric = Metric(embedder, scores, params)

# run
for task in params.transfer_tasks:
    # if task in GLUE_TASKS:
    #     glue.run(task)
    # if task in XNLI_TASKS:
    #     xnli.run()
    metric.run(task)

Evaluation file:

# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#

from logging import getLogger
import os
import copy
import time
import json
from collections import OrderedDict

import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

from scipy.stats import spearmanr, pearsonr
from sklearn.metrics import f1_score, matthews_corrcoef

from src.fp16 import network_to_half
from apex.fp16_utils import FP16_Optimizer

from ..utils import get_optimizer, concat_batches, truncate, to_cuda
from ..data.dataset import Dataset, ParallelDataset
from ..data.loader import load_binarized, set_dico_parameters

logger = getLogger()

class Metric:

    def __init__(self, embedder, scores, params):
        """
        Initialize GLUE trainer / evaluator.
        Initial `embedder` should be on CPU to save memory.
        """
        self._embedder = embedder
        self.params = params
        self.scores = scores

    def get_iterator(self, splt):
        """
        Build data iterator.
        """
        return self.data[splt]['x'].get_iterator(
            shuffle=(splt == 'train'),
            return_indices=True,
            group_by_size=self.params.group_by_size
        )

    def run(self, task):
        """
        Run GLUE training / evaluation.
        """
        params = self.params

        # task parameters
        self.task = task
        params.out_features = 1

        # load data
        self.data = self.load_data(task)
        if not self.data['dico'] == self._embedder.dico:
            raise Exception(("Dictionary in evaluation data (%i words) seems different than the one " +
                             "in the pretrained model (%i words). Please verify you used the same dictionary, " +
                             "and the same values for max_vocab and min_count.") % (len(self.data['dico']), len(self._embedder.dico)))

        # embedder
        self.embedder = copy.deepcopy(self._embedder)
        self.embedder.cuda()

        # projection layer
        self.proj = nn.Sequential(*[
            nn.Dropout(params.dropout),
            nn.Linear(self.embedder.out_dim, params.out_features)
        ]).cuda()

        # float16
        if params.fp16:
            assert torch.backends.cudnn.enabled
            self.embedder.model = network_to_half(self.embedder.model)
            self.proj = network_to_half(self.proj)

        # optimizer
        self.optimizer = get_optimizer(
            list(self.embedder.get_parameters(params.finetune_layers)) +
            list(self.proj.parameters()),
            params.optimizer
        )
        if params.fp16:
            self.optimizer = FP16_Optimizer(self.optimizer, dynamic_loss_scale=True)

        # train and evaluate the model
        for epoch in range(params.n_epochs):

            # update epoch
            self.epoch = epoch

            # training
            logger.info("Metric Task - %s - Training epoch %i ..." % (task, epoch))
            self.train()

            # evaluation
            logger.info("Metric Task - %s - Evaluating epoch %i ..." % (task, epoch))
            with torch.no_grad():
                scores = self.eval('valid')
                self.scores.update(scores)

    def train(self):
        """
        Finetune for one epoch on the training set.
        """
        params = self.params
        self.embedder.train()
        self.proj.train()

        # training variables
        losses = []
        ns = 0  # number of sentences
        nw = 0  # number of words
        t = time.time()

        iterator = self.get_iterator('train')
        lang_id_1 = params.lang2id[self.lang1]
        lang_id_2 = params.lang2id[self.lang2]

        while True:

            # batch
            try:
                batch = next(iterator)
            except StopIteration:
                break
            (sent1, len1), (sent2, len2), idx = batch
            sent1, len1 = truncate(sent1, len1, params.max_len, params.eos_index)
            sent2, len2 = truncate(sent2, len2, params.max_len, params.eos_index)
            x, lengths, positions, langs = concat_batches(sent1, len1, lang_id_1, sent2, len2, lang_id_2, params.pad_index, params.eos_index, reset_positions=False)
            y = self.data['train']['y'][idx]
            bs = len(lengths)

            # cuda
            x, y, lengths, positions, langs = to_cuda(x, y, lengths, positions, langs)

            # loss
            if params.ref_flag:
                output = self.proj(self.embedder.get_embeddings(x, lengths, positions))
            else:
                output = self.proj(self.embedder.get_embeddings(x, lengths, positions, langs))

            if params.fp16:
                loss = F.mse_loss(output.squeeze(1), y.type(torch.float16))
            else:
                loss = F.mse_loss(output.squeeze(1), y.float())

            # backward / optimization
            self.optimizer.zero_grad()
            if params.fp16:
                self.optimizer.backward(loss)
            else:
                loss.backward()
            self.optimizer.step()

            # update statistics
            ns += bs
            nw += lengths.sum().item()
            losses.append(loss.item())

            # log
            if ns != 0 and ns % (10 * bs) < bs:
                logger.info(
                    "Metric Task - %s - Epoch %s - Train iter %7i - %.1f words/s - %s Loss: %.4f"
                    % (self.task, self.epoch, ns, nw / (time.time() - t), 'MSE', sum(losses) / len(losses))
                )
                nw, t = 0, time.time()
                losses = []

            # epoch size
            if params.epoch_size != -1 and ns >= params.epoch_size:
                break

    def eval(self, splt='valid'):
        """
        Evaluate on XNLI validation and test sets, for all languages.
        """
        params = self.params
        self.embedder.eval()
        self.proj.eval()

        scores = OrderedDict({'epoch': self.epoch})
        task = self.task.lower()

        pred = []  # predicted values
        gold = []  # real values

        lang_id_1 = params.lang2id[self.lang1]
        lang_id_2 = params.lang2id[self.lang2]

        for batch in self.get_iterator(splt):

            # batch
            (sent1, len1), (sent2, len2), idx = batch
            # sent1, len1 = truncate(sent1, len1, params.max_len, params.eos_index)
            # sent2, len2 = truncate(sent2, len2, params.max_len, params.eos_index)
            x, lengths, positions, langs = concat_batches(sent1, len1, lang_id_1, sent2, len2, lang_id_2, params.pad_index, params.eos_index, reset_positions=False)
            y = self.data[splt]['y'][idx]

            # cuda
            x, y, lengths, positions, langs = to_cuda(x, y, lengths, positions, langs)

            # prediction
            if params.ref_flag:
                output = self.proj(self.embedder.get_embeddings(x, lengths, positions))
            else:
                output = self.proj(self.embedder.get_embeddings(x, lengths, positions, langs))
            p = output.squeeze(1)
            pred.append(p.cpu().numpy())
            gold.append(y.cpu().numpy())

        gold = np.concatenate(gold)
        pred = np.concatenate(pred)

        scores['%s_valid_prs' % task] = 100. * pearsonr(pred, gold)[0]
        scores['%s_valid_spr' % task] = 100. * spearmanr(pred, gold)[0]

        logger.info("__log__:%s (percentage not fraction)" % json.dumps(scores))
        return scores

    def load_data(self, task):
        """
        Load pair regression/classification bi-sentence tasks
        """
        params = self.params   
        if params.ref_flag:
            self.lang1 = 'en'
            self.lang2 = task.split('-')[1]
        else:
            self.lang1 = task.split('-')[0]
            self.lang2 = task.split('-')[1]

        data = {splt: {} for splt in ['train', 'valid', 'test']}
        dpath = os.path.join(params.data_path, task)

        for splt in ['train', 'valid', 'test']:

            # load data and dictionary
            if params.ref_flag:
                data1 = load_binarized(os.path.join(dpath, '%s.rf.%s.pth' % (task,splt)), params)
                data2 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang2,splt)), params)
            else:
                data1 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang1,splt)), params)
                data2 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang2,splt)), params)
            assert data1['dico'] == data2['dico']
            data['dico'] = data.get('dico', data1['dico'])

            # set dictionary parameters
            set_dico_parameters(params, data, data1['dico'])
            set_dico_parameters(params, data, data2['dico'])

            # create dataset
            data[splt]['x'] = ParallelDataset(
                data1['sentences'], data1['positions'],
                data2['sentences'], data2['positions'],
                params
            )

            # load labels
            if splt != 'test':
                # read labels from file
                with open(os.path.join(dpath, '%s.label.%s' % (task,splt)), 'r') as f:
                    lines = [l.rstrip() for l in f]
                assert all(-2 <= float(x) <= 2 for x in lines)
                y = [float(l) for l in lines]
                data[splt]['y'] = torch.LongTensor(y)
                assert len(data[splt]['x']) == len(data[splt]['y'])

        # compute weights for weighted training
        self.weights = None

        return data

Akella17 commented 5 years ago

@aconneau @glample The code for this issue can be found here.

Run metric_nmt.py to train the cross-lingual regression model.

aconneau commented 5 years ago

@Akella17 : Is your issue with STS-B fixed?

Akella17 commented 5 years ago

@aconneau Yes. The issue with STS-B does not exist in the latest version of XLM repo. Since my task is to perform regression finetuning on human annotated mt_scores, I made minimal modifications to the STS-B code in glue-xnli.py and src/evaluation/glue.py, and renamed them as metric_nmy.py and src/evaluation/Metric.py respectively to fit my requirements.

Training data consists of <source, machine output, reference, score labels> tuples. When I tried regression finetuning (same as STS-B) using BERT (input: <machine output, reference>, target: score labels), I get a Pearson correlation of ~60% on the English-German WMT metric task. However, when I tried using the above-mentioned, modified XLM code, I get a Pearson correlation of <15%. This result is consistent for both (input: <machine output, reference>, target: score labels) and (input: <machine output, source>, target: score labels) configurations.

aconneau commented 5 years ago

Great if the problem of the STS-B task is solved. Not sure if you're expecting something, I'm not going to have time to review your code above. I'm closing the issue but feel free to re-open if needed.

facebookresearch / XLM

MLM+TLM model fails at STS-B task #99