Closed Akella17 closed 5 years ago
I obtain the following results on the validation set:
sts-b_valid_prs : 89.1727
sts-b_valid_spr : 89.1524
but this is for a model trained on English exclusively that we will release soon, not the MLM + TLM one.
The released MLM+TLM model should be much better than 30% though. What hyper-parameters do you use for fine-tuning?
Thanks for the quick reply! The hyperparameter settings are the ones suggested in this documentation. My exact implementation is as follows:
python glue-xnli.py
--exp_name test_xnli_mlm_tlm # experiment name
--dump_path ./dumped/ # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth # model location
--data_path ./data/processed/XLM15 # data location
--transfer_tasks STS-B # transfer tasks (XNLI or GLUE tasks)
--optimizer adam,lr=0.000005 # optimizer
--batch_size 8 # batch size
--n_epochs 250 # number of epochs
--epoch_size 20000 # number of sentences per epoch
--max_len 256 # max number of words in sentences
--max_vocab 95000 # max number of words in vocab
I also tried using LR = 1e-4, 2e-5 but the score always converged to ~20-25%.
@glample By replacing the MLM+TLM (mlm_tlm_xnli15_1024.pth) model with English-German MLM (mlm_ende_1024.pth) model, I am able to get a score of around sts-b_valid_prs : 70%
. I have also tried BERT (which is nearly the same as MLM on English alone) and was able to get sts-b_valid_prs : 88%
.
Maybe the multi-language MLM training is somehow affecting the model's ability to understand the semantics of one particular language. Would like to hear your opinions @glample @aconneau
I don't think that another language can have such an effect. Probably the domain is the main reason. We also tried to train BERT English only on news (mlm_ende_1024.pth is trained on NewsCrawl) and the performance was clearly lower on the GLUE tasks than when training with Wikipedia + Toronto book corpus (I forgot by how much, but this was really significant). So I'm not surprised by all these results. A bit surprised about the score you get for mlm_tlm_xnli15_1024.pth though. I guess that with 15 languages the cross-lingual aspect can start becoming detrimental, but again I'm pretty sure the domain remains the most important issue.
@glample Can you mention the dataset used to train the English exclusive MLM model (one that achieves sts-b_valid_prs : 89.1727
)?
It was Wikipedia + Toronto book corpus, like in the original BERT paper.
@glample Oh okay. Now it makes sense. Thanks for the clarification.
If Wikipedia + Toronto book corpus gives better performance on GLUE tasks, then why was MLM+TLM model trained on NewsCrawl corpus (for English MLM task)? I am just asking this to avoid training an XLM from scratch (in case Wikipedia + Toronto book corpus does not work well in MLM+TLM settings along with 14 other languages). My downstream task needs a cross-lingual language model that can capture high-level similarity in sentences across languages.
So the MLM+TLM was also trained on the Wiki+toronto book for the monolingual English side. But for other languages, we had to find something else. We used wikipedia for all languages, and for the TLM (English-XX) we used parallel data which was from other domains, it was a mix of lots of parallel data we found like OpenSubtitles, Europarl, MultiUN, etc. which altered the proportion of the Wiki+TBC used to train the English side.
@glample So, if I understood correctly, can I expect a pure MLM model trained on Wiki alone for 14 languages and Wiki+toronto book for English to provide a considerable improvement over XLM+MLM model for cross-lingual sentence similarity tasks?
Over MLM + TLM you mean? Yes, I think it should be quite better. I'm very surprised by the 30% though, it seems surprisingly low. Will try to have a look soon.
@glample Since the parallel corpus used for the TLM training might result in domain issues for downstream tasks, I was wondering if training with MLM objective alone on Wiki of all the 15 languages (no parallel data) might be better for downstream language understanding tasks (monolingual or cross-lingual).
Hey, will you be releasing the MLM (Wiki 15 languages) model? I am eager to try it for STS-B task. This way we can know for sure if the problem with the performance on STS-B is a result of a domain issue or cross-lingual training.
I just uploaded the MLM only model for the 15 languages: https://github.com/facebookresearch/XLM#pretrained-models Can you try to see if this works better for you than the MLM + TLM?
@glample The MLM model for 15 languages gives sts-b_valid_prs : 50%
.
That's interesting. I guess the domain is more important than I thought. I remember it was important to add the "Toronto book corpus" in English, maybe this is what is missing for the other languages. Or maybe the cross-lingual aspect is more detrimental than expected.
@glample Is there any empirical result that determines the quality of embeddings generated by the XLM model without estimation of additional parameters or finetuning? Like maybe some degree of correlation with a human annotated score or a standard evaluation metric.
I tried to check the performance of XLM for evaluating text generation by following the steps described in BertScore (https://arxiv.org/abs/1904.09675). Strangely, human annotated scores appear to be totally uncorrelated with XLM embeddings.
Pearson correlation coefficient (measures linear relationship) : -0.125
Spearman's rank-order correlation (measures monotonic relationship) : -0.129
The success of XLM at various tasks like XNLI, UNMT, etc might mean that there must exist some kind of non-linear relationship between human scores and XLM embeddings. What is your opinion on this and is there a way to use XLM as a scoring metric?
@Akella17 : There must be a bug in your fine-tuning. I ran the evaluation of XLM_15 for MLM and MLM+TLM on STS-B and I obtained 86.4/86.9 (pearson/spearman) for the latter (mBERT is at 86%, although it has more languages).
Your result might be due to a preprocessing issue? I would like to understand how you obtained that result as it might be related to a bug on the github.
@aconneau @glample My exact implementation is as follows:
python glue-xnli.py
--exp_name test_xnli_mlm_tlm # experiment name
--dump_path ./dumped/ # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth # model location
--data_path ./data/processed/XLM15 # data location
--transfer_tasks STS-B # transfer tasks (XNLI or GLUE tasks)
--optimizer adam,lr=0.000005 # optimizer
--batch_size 8 # batch size
--n_epochs 250 # number of epochs
--epoch_size 20000 # number of sentences per epoch
--max_len 256 # max number of words in sentences
--max_vocab 95000 # max number of words in vocab
I also tried using LR = 1e-4, 2e-5 but the score always converged to ~20-25%.
@Akella17 this is not expected. Can you send us your training log so we can have a look?
Thanks @Akella17 for the feedback, there seems to be a problem indeed, I will look at it and see if I find a problem. In the meantime, as Guillaume mentioned, could you provide us with your log? Thanks
@aconneau @glample Sorry for the late reply. There must have been some problem with the older version of the repository that led to bad STS-B score. The latest version of the repository works fine on STS-B task.
My target downstream task was to reproduce BERTSCORE paper on XLM15. While BERTSCORE does n-gram matching over the output embeddings of the two sentences (mt_output and reference), I wanted to solve the same problem with regression (like STS-B) and finetuning. I was able to achieve the BERT n-gram score (mt_output and reference) on XLM15, but the regression score stayed nearly uncorrelated with human annotated scores. Also, the cross-lingual (source and mt_output) n-gram score is not great either (this, however, can be a separate issue).
My code for this is a slight modification of glue-xnli.py
and src\evaluation\glue.py
files,
Main file:
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import os
import argparse
from src.utils import bool_flag, initialize_exp
from src.evaluation.Metric import Metric
from src.model.embedder import SentenceEmbedder
# parse parameters
parser = argparse.ArgumentParser(description='Train on GLUE or XNLI')
# main parameters
parser.add_argument("--exp_name", type=str, default="test_Metric_task",
help="Experiment name")
parser.add_argument("--dump_path", type=str, default="./dumped/",
help="Experiment dump path")
parser.add_argument("--exp_id", type=str, default="",
help="Experiment ID")
parser.add_argument("--ref_flag", type=bool_flag, default=False,
help="Weather to train on reference or source sentences")
# float16
parser.add_argument("--fp16", type=bool_flag, default=False,
help="Run model with float16")
# evaluation task / pretrained model
parser.add_argument("--GPU_ids", type=str, default="0",
help="Which GPUs to run the code on")
parser.add_argument("--transfer_tasks", type=str, default="de-en",
help="Transfer tasks, example: 'MNLI-m,RTE,XNLI' ")
parser.add_argument("--model_path", type=str, default="mlm_en_2048.pth",
help="Model location")
# data
parser.add_argument("--data_path", type=str, default="",
help="Data path")
parser.add_argument("--max_vocab", type=int, default=95000,
help="Maximum vocabulary size (-1 to disable)")
parser.add_argument("--min_count", type=int, default=0,
help="Minimum vocabulary count")
# batch parameters
parser.add_argument("--max_len", type=int, default=256,
help="Maximum length of sentences (after BPE)")
parser.add_argument("--group_by_size", type=bool_flag, default=False,
help="Sort sentences by size during the training")
parser.add_argument("--batch_size", type=int, default=8,
help="Number of sentences per batch")
parser.add_argument("--max_batch_size", type=int, default=0,
help="Maximum number of sentences per batch (used in combination with tokens_per_batch, 0 to disable)")
parser.add_argument("--tokens_per_batch", type=int, default=-1,
help="Number of tokens per batch")
# model / optimization
parser.add_argument("--finetune_layers", type=str, default='0:_1',
help="Layers to finetune. 0 = embeddings, _1 = last encoder layer")
parser.add_argument("--weighted_training", type=bool_flag, default=False,
help="Use a weighted loss during training")
parser.add_argument("--dropout", type=float, default=0,
help="Fine-tuning dropout")
parser.add_argument("--optimizer", type=str, default="adam,lr=0.000005",
help="Optimizer")
parser.add_argument("--n_epochs", type=int, default=250,
help="Maximum number of epochs")
parser.add_argument("--epoch_size", type=int, default=20000,
help="Epoch size (-1 for full pass over the dataset)")
# debug
parser.add_argument("--debug_train", type=bool_flag, default=False,
help="Use valid sets for train sets (faster loading)")
parser.add_argument("--debug_slurm", type=bool_flag, default=False,
help="Debug multi-GPU / multi-node within a SLURM job")
# parse parameters
params = parser.parse_args()
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=params.GPU_ids
params.data_path = './data/processed/'+params.data_path
if params.tokens_per_batch > -1:
params.group_by_size = True
# check parameters
assert os.path.isdir(params.data_path)
assert os.path.isfile(params.model_path)
# tasks
params.transfer_tasks = params.transfer_tasks.split(',')
assert len(params.transfer_tasks) > 0
# reload pretrained model
embedder = SentenceEmbedder.reload(params.model_path, params)
# reload langs from pretrained model
params.n_langs = embedder.pretrain_params['n_langs']
params.id2lang = embedder.pretrain_params['id2lang']
params.lang2id = embedder.pretrain_params['lang2id']
# initialize the experiment / build sentence embedder
logger = initialize_exp(params)
scores = {}
# prepare trainers / evaluators
# glue = GLUE(embedder, scores, params)
# xnli = XNLI(embedder, scores, params)
metric = Metric(embedder, scores, params)
# run
for task in params.transfer_tasks:
# if task in GLUE_TASKS:
# glue.run(task)
# if task in XNLI_TASKS:
# xnli.run()
metric.run(task)
Evaluation file:
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
from logging import getLogger
import os
import copy
import time
import json
from collections import OrderedDict
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
from scipy.stats import spearmanr, pearsonr
from sklearn.metrics import f1_score, matthews_corrcoef
from src.fp16 import network_to_half
from apex.fp16_utils import FP16_Optimizer
from ..utils import get_optimizer, concat_batches, truncate, to_cuda
from ..data.dataset import Dataset, ParallelDataset
from ..data.loader import load_binarized, set_dico_parameters
logger = getLogger()
class Metric:
def __init__(self, embedder, scores, params):
"""
Initialize GLUE trainer / evaluator.
Initial `embedder` should be on CPU to save memory.
"""
self._embedder = embedder
self.params = params
self.scores = scores
def get_iterator(self, splt):
"""
Build data iterator.
"""
return self.data[splt]['x'].get_iterator(
shuffle=(splt == 'train'),
return_indices=True,
group_by_size=self.params.group_by_size
)
def run(self, task):
"""
Run GLUE training / evaluation.
"""
params = self.params
# task parameters
self.task = task
params.out_features = 1
# load data
self.data = self.load_data(task)
if not self.data['dico'] == self._embedder.dico:
raise Exception(("Dictionary in evaluation data (%i words) seems different than the one " +
"in the pretrained model (%i words). Please verify you used the same dictionary, " +
"and the same values for max_vocab and min_count.") % (len(self.data['dico']), len(self._embedder.dico)))
# embedder
self.embedder = copy.deepcopy(self._embedder)
self.embedder.cuda()
# projection layer
self.proj = nn.Sequential(*[
nn.Dropout(params.dropout),
nn.Linear(self.embedder.out_dim, params.out_features)
]).cuda()
# float16
if params.fp16:
assert torch.backends.cudnn.enabled
self.embedder.model = network_to_half(self.embedder.model)
self.proj = network_to_half(self.proj)
# optimizer
self.optimizer = get_optimizer(
list(self.embedder.get_parameters(params.finetune_layers)) +
list(self.proj.parameters()),
params.optimizer
)
if params.fp16:
self.optimizer = FP16_Optimizer(self.optimizer, dynamic_loss_scale=True)
# train and evaluate the model
for epoch in range(params.n_epochs):
# update epoch
self.epoch = epoch
# training
logger.info("Metric Task - %s - Training epoch %i ..." % (task, epoch))
self.train()
# evaluation
logger.info("Metric Task - %s - Evaluating epoch %i ..." % (task, epoch))
with torch.no_grad():
scores = self.eval('valid')
self.scores.update(scores)
def train(self):
"""
Finetune for one epoch on the training set.
"""
params = self.params
self.embedder.train()
self.proj.train()
# training variables
losses = []
ns = 0 # number of sentences
nw = 0 # number of words
t = time.time()
iterator = self.get_iterator('train')
lang_id_1 = params.lang2id[self.lang1]
lang_id_2 = params.lang2id[self.lang2]
while True:
# batch
try:
batch = next(iterator)
except StopIteration:
break
(sent1, len1), (sent2, len2), idx = batch
sent1, len1 = truncate(sent1, len1, params.max_len, params.eos_index)
sent2, len2 = truncate(sent2, len2, params.max_len, params.eos_index)
x, lengths, positions, langs = concat_batches(sent1, len1, lang_id_1, sent2, len2, lang_id_2, params.pad_index, params.eos_index, reset_positions=False)
y = self.data['train']['y'][idx]
bs = len(lengths)
# cuda
x, y, lengths, positions, langs = to_cuda(x, y, lengths, positions, langs)
# loss
if params.ref_flag:
output = self.proj(self.embedder.get_embeddings(x, lengths, positions))
else:
output = self.proj(self.embedder.get_embeddings(x, lengths, positions, langs))
if params.fp16:
loss = F.mse_loss(output.squeeze(1), y.type(torch.float16))
else:
loss = F.mse_loss(output.squeeze(1), y.float())
# backward / optimization
self.optimizer.zero_grad()
if params.fp16:
self.optimizer.backward(loss)
else:
loss.backward()
self.optimizer.step()
# update statistics
ns += bs
nw += lengths.sum().item()
losses.append(loss.item())
# log
if ns != 0 and ns % (10 * bs) < bs:
logger.info(
"Metric Task - %s - Epoch %s - Train iter %7i - %.1f words/s - %s Loss: %.4f"
% (self.task, self.epoch, ns, nw / (time.time() - t), 'MSE', sum(losses) / len(losses))
)
nw, t = 0, time.time()
losses = []
# epoch size
if params.epoch_size != -1 and ns >= params.epoch_size:
break
def eval(self, splt='valid'):
"""
Evaluate on XNLI validation and test sets, for all languages.
"""
params = self.params
self.embedder.eval()
self.proj.eval()
scores = OrderedDict({'epoch': self.epoch})
task = self.task.lower()
pred = [] # predicted values
gold = [] # real values
lang_id_1 = params.lang2id[self.lang1]
lang_id_2 = params.lang2id[self.lang2]
for batch in self.get_iterator(splt):
# batch
(sent1, len1), (sent2, len2), idx = batch
# sent1, len1 = truncate(sent1, len1, params.max_len, params.eos_index)
# sent2, len2 = truncate(sent2, len2, params.max_len, params.eos_index)
x, lengths, positions, langs = concat_batches(sent1, len1, lang_id_1, sent2, len2, lang_id_2, params.pad_index, params.eos_index, reset_positions=False)
y = self.data[splt]['y'][idx]
# cuda
x, y, lengths, positions, langs = to_cuda(x, y, lengths, positions, langs)
# prediction
if params.ref_flag:
output = self.proj(self.embedder.get_embeddings(x, lengths, positions))
else:
output = self.proj(self.embedder.get_embeddings(x, lengths, positions, langs))
p = output.squeeze(1)
pred.append(p.cpu().numpy())
gold.append(y.cpu().numpy())
gold = np.concatenate(gold)
pred = np.concatenate(pred)
scores['%s_valid_prs' % task] = 100. * pearsonr(pred, gold)[0]
scores['%s_valid_spr' % task] = 100. * spearmanr(pred, gold)[0]
logger.info("__log__:%s (percentage not fraction)" % json.dumps(scores))
return scores
def load_data(self, task):
"""
Load pair regression/classification bi-sentence tasks
"""
params = self.params
if params.ref_flag:
self.lang1 = 'en'
self.lang2 = task.split('-')[1]
else:
self.lang1 = task.split('-')[0]
self.lang2 = task.split('-')[1]
data = {splt: {} for splt in ['train', 'valid', 'test']}
dpath = os.path.join(params.data_path, task)
for splt in ['train', 'valid', 'test']:
# load data and dictionary
if params.ref_flag:
data1 = load_binarized(os.path.join(dpath, '%s.rf.%s.pth' % (task,splt)), params)
data2 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang2,splt)), params)
else:
data1 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang1,splt)), params)
data2 = load_binarized(os.path.join(dpath, '%s.%s.%s.pth' % (task,self.lang2,splt)), params)
assert data1['dico'] == data2['dico']
data['dico'] = data.get('dico', data1['dico'])
# set dictionary parameters
set_dico_parameters(params, data, data1['dico'])
set_dico_parameters(params, data, data2['dico'])
# create dataset
data[splt]['x'] = ParallelDataset(
data1['sentences'], data1['positions'],
data2['sentences'], data2['positions'],
params
)
# load labels
if splt != 'test':
# read labels from file
with open(os.path.join(dpath, '%s.label.%s' % (task,splt)), 'r') as f:
lines = [l.rstrip() for l in f]
assert all(-2 <= float(x) <= 2 for x in lines)
y = [float(l) for l in lines]
data[splt]['y'] = torch.LongTensor(y)
assert len(data[splt]['x']) == len(data[splt]['y'])
# compute weights for weighted training
self.weights = None
return data
@aconneau @glample The code for this issue can be found here.
Run metric_nmt.py
to train the cross-lingual regression model.
@Akella17 : Is your issue with STS-B fixed?
@aconneau Yes. The issue with STS-B does not exist in the latest version of XLM repo. Since my task is to perform regression finetuning on human annotated mt_scores, I made minimal modifications to the STS-B code in glue-xnli.py
and src/evaluation/glue.py
, and renamed them as metric_nmy.py
and src/evaluation/Metric.py
respectively to fit my requirements.
Training data consists of <source, machine output, reference, score labels> tuples. When I tried regression finetuning (same as STS-B) using BERT (input: <machine output, reference>, target: score labels), I get a Pearson correlation of ~60%
on the English-German WMT metric task. However, when I tried using the above-mentioned, modified XLM code, I get a Pearson correlation of <15%
. This result is consistent for both (input: <machine output, reference>, target: score labels) and (input: <machine output, source>, target: score labels) configurations.
Great if the problem of the STS-B task is solved. Not sure if you're expecting something, I'm not going to have time to review your code above. I'm closing the issue but feel free to re-open if needed.
The pretrained MLM+TLM model achieves < 30% Pearson correlation with human scores. With CBOW at 60% and BERT at 86%, this score seems low for the STS-B task.
I am not sure if there is a mistake in the implementation from my end or if the MLM+TLM model does not work for STS-B task. Can someone confirm this @aconneau, @glample