brightmart / albert_zh

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
https://arxiv.org/pdf/1909.11942.pdf
3.93k stars 753 forks source link

English pre-trained release? #11

Closed jpcorb20 closed 5 years ago

jpcorb20 commented 5 years ago

Hello,

Are you planning on releasing English pre-trained versions of Albert in the future?

Thank you,

brightmart commented 5 years ago

as albert already achieved state of the art performance of main english benchmarks, i think authors of the paper will release english revision in the near future.

jpcorb20 commented 5 years ago

Yes, it is true. I haven't found any info on a release yet. Maybe, I should train one myself. Thank you very much.

brightmart commented 5 years ago

thank you. where can i find english corpus that used in the papers?

jpcorb20 commented 5 years ago

If I am not mistaken, it is the same as BERT. It is BookCorpus and English Wikipedia (the pre-processing here https://github.com/attardi/wikiextractor).

Awesome work by the way to reproduce ALBERT.

brightmart commented 5 years ago

great. thanks.

pohanchi commented 5 years ago

Hi , I am also interested in this idea, by the way do u have bookcorpus data?

brightmart commented 5 years ago

were you able to find bookcorpus using jpcorb20's link above?

pohanchi commented 5 years ago

No, but I ‘m curious about that. Bc I try to find it. But I didn’t get sth link

jpcorb20 commented 5 years ago

You are right, sorry, the link inside is down. At the moment, the only thing I found is this library to crawl the original website: https://github.com/soskek/bookcorpus ...

pohanchi commented 5 years ago

ok, thanks. I just want that dataset XD