BERT-ita-xxl - Question about corpus

dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models

MIT License

155 stars 12 forks source link

BERT-ita-xxl - Question about corpus #43

Closed NeuroinformaticaFBF closed 2 years ago

NeuroinformaticaFBF commented 2 years ago

Greetings dear dbmdz team

One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?

I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.

Many thanks, Cheers

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF ,

I'm not sure how the OPUS corpora were created, but here are the relevant files/corpora that were used from OPUS - with the relevant info pages:

it_emea.txt - https://opus.nlpl.eu/EMEA-v3.php
it_eubookshop.txt - https://opus.nlpl.eu/EUbookshop-v2.php
it_europarl.txt - https://opus.nlpl.eu/Europarl-v8.php
it_jrc.txt - https://opus.nlpl.eu/JRC-Acquis-v3.0.php
it_jw300.txt - ?
it_opensubtitles.txt - https://opus.nlpl.eu/OpenSubtitles-v2018.php
it_opusdgt.txt - https://opus.nlpl.eu/DGT-v2019.php
it_paracrawl.txt - https://opus.nlpl.eu/ParaCrawl-v8.php
it_tildemodel.txt - https://opus.nlpl.eu/TildeMODEL-v2018.php
it_wikipedia_opus.txt - https://opus.nlpl.eu/Wikipedia-v1.0.php

So you may check the info on these info pages if you can find some translated datasets :)

NeuroinformaticaFBF commented 2 years ago

Many thanks, I'm looking into that. Just to be clear: was the Wikipedia dump from italian Wikipedia?

stefan-it commented 2 years ago

Hi @NeuroinformaticaFBF , yes it was a dump from official Wikimedia page. Basically, it was downloaded and processed like this:

wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2

# Use old version of WikiExtractor
wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py

python3 WikiExtractor.py -c -b 25M -o extracted itwiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > itwiki.xml

# Postprocessing
sed -i 's/<[^>]*>//g' itwiki.xml
sed -i '/^\s*$/d' itwiki.xml
rm -rf extracted
mv itwiki.xml itwiki.txt

NeuroinformaticaFBF commented 2 years ago

Thanks for your answer