Closed NeuroinformaticaFBF closed 2 years ago
Hi @NeuroinformaticaFBF ,
I'm not sure how the OPUS corpora were created, but here are the relevant files/corpora that were used from OPUS - with the relevant info pages:
So you may check the info on these info pages if you can find some translated datasets :)
Many thanks, I'm looking into that. Just to be clear: was the Wikipedia dump from italian Wikipedia?
Hi @NeuroinformaticaFBF , yes it was a dump from official Wikimedia page. Basically, it was downloaded and processed like this:
wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
# Use old version of WikiExtractor
wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
python3 WikiExtractor.py -c -b 25M -o extracted itwiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > itwiki.xml
# Postprocessing
sed -i 's/<[^>]*>//g' itwiki.xml
sed -i '/^\s*$/d' itwiki.xml
rm -rf extracted
mv itwiki.xml itwiki.txt
Thanks for your answer
Greetings dear dbmdz team
One question that I would like to ask: as I see from HuggingFace page of the model, "The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection... For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus". Were these dataset originally written in Italian, or were they english text translated?
I'm asking this because my medical corpus was originally written in English, and I used the Google translated API to translate it. So I would like to estimate the bias introduced by this operation.
Many thanks, Cheers