SpanBERTa is a transformer language model for Spanish. SpanBERTa is of the same size as BERT-Base and is trained on 18 GB of OSCAR’s Spanish corpus following the pretraining approach specified in RoBERTa: A Robustly Optimized BERT Pretraining Approach.
Module | Download |
---|---|
SpanBERTa | config.json, pytorch_model.bin |
Tokenizer | merges.txt, vocab.json |
The model uses a Byte-level BPE tokenizer with a vocabulary size of 50265 sub-word tokens and is trained for 200K steps in 8 days using 4 Tesla V100 GPUs.
Training loss: tensorboard
The model weights can be loaded using transformers
library by Hugging Face.
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("skimai/spanberta-base-cased")
model = AutoModelWithLMHead.from_pretrained("skimai/spanberta-base-cased")
Example using pipeline
:
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="skimai/spanberta-base-cased",
tokenizer="skimai/spanberta-base-cased"
)
fill_mask("Lavarse frecuentemente las manos con agua y <mask>.")
[{'score': 0.6469631195068359,
'sequence': '<s> Lavarse frecuentemente las manos con agua y jabón.</s>',
'token': 18493},
{'score': 0.06074320897459984,
'sequence': '<s> Lavarse frecuentemente las manos con agua y sal.</s>',
'token': 619},
{'score': 0.029787985607981682,
'sequence': '<s> Lavarse frecuentemente las manos con agua y vapor.</s>',
'token': 11079},
{'score': 0.026410052552819252,
'sequence': '<s> Lavarse frecuentemente las manos con agua y limón.</s>',
'token': 12788},
{'score': 0.017029203474521637,
'sequence': '<s> Lavarse frecuentemente las manos con agua y vinagre.</s>',
'token': 18424}]