Semantic Role Labeling in Portuguese: Improving the State of the Art with Transfer Learning and BERT-based Models

This work was developed in the context of my Master's thesis in Data Science. The code is based on AllenNLP's package and the pre-trained models used came from 🤗 Transformers and neuralmind-ai BERTimbau - Portuguese BERT.

There are three branches in this repository, which correspond to three different versions of the AllenNLP package. The branch v1.0.0rc3 contains the code used to train the models reported in the article. The branch v1.0.0 contains the code used to test the models reported. The models were trained and tested in different versions because of a bug in version 1.0.0rc3 of AllenNLP which prevented testing some models. The main branch contains the code needed to make predictions with the trained models.

Models

The trained models can be obtained using the get_model.py script in the main branch.

python get_model.py [model name]

This is necessary because the trained SRL model is split into two; the transformers portion of the model is stored in the 🤗 Transformers community models and the linear layer is stored in this repository (in the folder Models). The get_model.py file joins these two portions and saves the complete model in a folder [model name].

In the following table, we present all the possible model names, a small description of the model, the average F₁ in the cross-validation PropBank.Br data sets and the average F₁ in the Buscapé set. For more information, please refer to the article.

Model Name	F₁ CV PropBank.Br (in domain)	F₁ Buscapé (out of domain)	Explanation
`srl-pt_bertimbau-base`	76.30	73.33	The (monolingual) BERTimbau_base model trained on Portuguese SRL data
`srl-pt_bertimbau-large`	77.42	74.85	The (monolingual) BERTimbau_large model trained on Portuguese SRL data
`srl-pt_xlmr-base`	75.22	72.82	The (multilingual) XLM-R_base model trained on Portuguese SRL data
`srl-pt_xlmr-large`	77.59	73.84	The (multilingual) XLM-R_large model trained on Portuguese SRL data
`srl-pt_mbert-base`	72.76	66.89	The multilingual cased BERT model trained on Portuguese SRL data
`srl-en_xlmr-base`	66.59	65.24	The (multilingual) XLM-R_base model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and tested on Portuguese SRL data
`srl-en_xlmr-large`	67.60	64.94	The (multilingual) XLM-R_large model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and tested on Portuguese SRL data
`srl-en_mbert-base`	63.07	58.56	The multilingual cased BERT model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and tested on Portuguese SRL data
`srl-enpt_xlmr-base`	76.50	73.74	The (multilingual) XLM-R_base model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and then on Portuguese SRL data
`srl-enpt_xlmr-large`	78.22	74.55	The (multilingual) XLM-R_large model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and then on Portuguese SRL data
`srl-enpt_mbert-base`	74.88	69.19	The multilingual cased BERT model trained on English SRL data (specifically a pre-processed CoNLL-2012 data set) and then on Portuguese SRL data
`ud_srl-pt_bertimbau-large`	77.53	74.49	The (monolingual) BERTimbau_large model trained first in dependency parsing with the Universal Dependecies Portuguese data set and then on Portuguese SRL data
`ud_srl-pt_xlmr-large`	77.69	74.91	The (monolingual) XLM-R_large model trained first in dependency parsing with the Universal Dependecies Portuguese data set and then on Portuguese SRL data
`ud_srl-enpt_xlmr-large`	77.97	75.05	The (monolingual) XLM-R_large model trained first in dependency parsing with the Universal Dependecies Portuguese data set, then on English SRL data (specifically a pre-processed CoNLL-2012 data set) and finally on Portuguese SRL data

To Predict

In order to use the trained models for SRL prediction, first install alennlp and allennlp_models v1.2.2. With pip:

pip install allennlp==1.2.2 allennlp_models==1.2.2

Download the main branch of this repository. From the list of available models (see Table above), choose the one most indicated for your application (see Choosing the best model below for help) and download the model using:

python get_model.py [model name]

Then run the my_predict.py script with the chosen model and the text you want to predict SRL labels for.

python my_predict.py [model name] [text/to/predict] ([lang])

[text/to/predict] can be either a string or the path to a text file containing the text you want to predict SRL labels for. [lang] is an optional argument for when you want to predict english sentences. For portuguese, nothing should be written after [text/to/predict]. For english, write anything there.

The results are written to output.txt. Note that if you run this several times, the results are added to the existing file, it is not overwritten.

Example

python get_model.py srl-pt_bertimbau-large

python my_predict.py srl-pt_bertimbau-large "Só precisa ganhar experiência"
#or
python my_predict.py srl-pt_bertimbau-large pred.txt #where pred.txt contains Só precisa ganhar experiência

#For english:
python my_predict.py srl-pt_bertimbau-large "John read a book." english

Choosing the best model

We provide an implementation of the heuristic mentioned in the article, described by the following figure (taken from the article mentioned in Citation).

Image of heuristic

To run the Choose Best Model/tool.py script, you must install streamlit.

pip install streamlit

streamlit run "Choose Best Model/tool.py"

In this app, you can choose the semantic roles of interest for your application (by removing the ones that do not interest you) and the type of data you have. The results will be the best model and plots showing the total F₁ measure and the F₁ measure for each role achieved by each model.

Branch v1.0.0rc3

To reproduce the results, it is first necessary to train the models. For that, first install the pytorch package v1.5.0 with the command from their website according to the CUDA version of your machine, and then allennlp, allennlp_models, iterative-stratification and pandas.

pip install allennlp==1.0.0rc3 allennlp_models==1.0.0rc3 iterative-stratification pandas

Next, clone or download the v1.0.0rc3 branch of this repository.

The data must be manually added. The code expects there to be a data folder (inside the folder with this repository). Within this folder, there must be 4 folders:

xml_data -- contains the XML data for PropBank.Br v1.1, PropBank.Br v2 and Buscapé. This data can be found here.
conll_data -- contains the conll version of PropBank.Br. This data can be found here.
ud -- contains the Portuguese Universal Depdencies dataset.
conll-formatted-ontonotes-5.0-12 -- contains the conll formatted OntoNotes v5.0.

Transforming XML to CoNLL data

python xml_to_conll.py

Create folds

python create_folds.py

Train all the models

python train.py

Branch v1.0.0

To reproduce the results, it is then necessary to test the models. For that, first install the pytorch package v1.6.0 with the command from their website according to the CUDA version of your machine, and then allennlp, allennlp_models, iterative-stratification and pandas.

pip install allennlp==1.0.0 allennlp_models==1.0.0 iterative-stratification pandas

Next, clone or download the v1.0.0 branch of this repository.

The data must be manually added -- simply copy the data folder obtained previously.

Test all the models

python train.py

Besides the metrics for each test fold and Buscapé, the program also outputs for each tested pair (model,dataset) a file with the predicted and gold tags.

Citation

@INPROCEEDINGS{9564238,  
      author={Oliveira, Sofia and Loureiro, Daniel and Jorge, Alípio},  
      booktitle={2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)},   
      title={Improving Portuguese Semantic Role Labeling with Transformers and Transfer Learning},   
      year={2021},  
      volume={},  
      number={},  
      pages={1-9},  
      doi={10.1109/DSAA53316.2021.9564238}
}

asofiaoliveira / srl_bert_pt

readme