certainlyio / nordic_bert

Pre-trained Nordic models for BERT
Creative Commons Attribution 4.0 International
157 stars 11 forks source link

Next language #2

Open mollerhoj opened 4 years ago

mollerhoj commented 4 years ago

What language BERT model would you like to be released next?

ViktorAlm commented 4 years ago

Nice work!

Swedish: https://github.com/af-ai-center/bert

Finnish: https://github.com/TurkuNLP/FinBERT

emillykkejensen commented 4 years ago

ALBERT in Danish ;-)

grofte commented 4 years ago

ALBERT in Danish ;-)

Or you could do knowledge distillation on this model. Here's a ton of synopses and links: https://blog.inten.to/speeding-up-bert-5528e18bb4ea
The Huawei TinyBERT and the single layer model are both about a magnitude faster at inference than the base model.

jbingel commented 4 years ago

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

grofte commented 4 years ago

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

Joachim, I was suggesting that @emillykkejensen make one, not BotXO =]
But you could do it too! You're at a uni - you have lots of time.

ALBERT requires the source material but any kind of knowledge distillation method can run from the BERT weights posted here.

What would be really nice though would be if BotXO would post their code for text preprocessing here. I can see that they do everything lower-case but don't reduce repeated characters. That's a bit sad in my mind. I don't need a token for '=======' or whatever it was I saw in there. This function should do it:

def squeeze_strings(s):
  if type(s) == float:
    return s
  else:
    s = re.sub(r'(?P<rep>.)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    s = re.sub(r'(?P<rep>..)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    return s

There are only 25 words in the Danish language that should be affected by squeeze_strings() and in all cases it would just change the conjugation (e.g. from "bortopererer" to "bortoperer", that is from present case to imperative).

mollerhoj commented 4 years ago

I would be surprised if the optimisation described above have any measurable impact on performance. Is it something that's been mentioned in the litterature? If so, I've missed it.

I don't have permission to open source the data fetching/preprocessing code (yet) but it's currently quite hacky, cleaning up bad stuff from the internet (there is a surprising amount of NSFW stuff on Common Crawl)

I'm currently in dialog with a norwegian proffessor about training AlBERT models :-)

Edit: Oh, and thank you so much for the interest and participation guys, very happy to see that 👍

jbingel commented 4 years ago

The blog post that @grofte links to, and the ALBERT paper linked in there (https://openreview.net/pdf?id=H1eA7AEtvS) do state those speedups quite clearly. Fewer parameters of course also means a smaller burden on memory.

(And answering @grofte -- you're wrong, I'm not at a uni at this very point in time, and even if I was, that wouldn't mean I had lots of time. :) )

mollerhoj commented 4 years ago

@jbingel sorry for being unclear: I was refering to the proposed squeeze_strings function. I'm well aware of the other improvements made to bert (roberta, albert, distilbert etc).

grofte commented 4 years ago

@mollerhoj Oh, I don't think you would see a difference in performance. But you would probably train faster. The function is just stripping out nonsense filler with no semantic content.

But I'm guessing that you guys do normalization through decomposition, sentence splitting (how though?), lower case, strip everything not in ascii+æøå. BERT tokenizer should take take of everything else once it has the vocab file. The other stuff you do to prepare the data for pre-training shouldn't be that interesting for employing the model.

VildMedPap commented 4 years ago

Danish cased model!