bookcorpus Search Results

211 results
for bookcorpus

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

dmort27/epitran #82

Speeding up mapping with HuggingFace datasets

Hi. I am trying to convert corpora from HF to their IPA form with the following snippet. But I am getting really slow speeds.. only a couple of examples per second. Do you know how it can be sped up? …

MaveriQ updated 6 months ago
5
dotnet/machinelearning-modelbuilder #2636

Text classification multilanguage

ML.NET 2.0 comes with Text classification. The Text classification scenario is based on NAS-BERT, which is trained on _English Wikipedia plus BookCorpus_. When using text classification for a not …

IzzyHibbert updated 1 year ago
2
google-research/bert #514

Is BERT a kind of cheating?

I know BERT has achieved SOTA in many NLP tasks, such as SQuAD and SWAG. But note that the data (both training and test) of SQuAD is from **Wikipedia**, and that of SWAG is from the **BookCorpus**, a…

erikchwang updated 3 years ago
13
FlagOpen/FlagEmbedding #259

Threshold for unlabeled data

There is a paragraph like this on page 4 of your technical report: > The text pairs selected from the web and other public sources are not guaranteed to be closely related. related. Therefore, dat…

iambestfeeddddd updated 8 months ago
4
dmlc/gluon-nlp #1592

(not a bug) question about bert `create_pretraining_data.tok…

## Description In the function [`scripts.pretraining.bert.create_pretraining_data.tokenize_lines()`](https://github.com/dmlc/gluon-nlp/blob/5ff0519aa5a89e7e2a2c0afab164e17de55231b4/scripts/pretrainin…

kiukchung updated 2 years ago
1
horseee/LLM-Pruner #77

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in …

Have some people met this problem? I know it's because of the dataset being loaded, and it's possible that it's a version issue or a gzip archive issue. But I don't know how to modify it. `UnicodeD…

BrownTan updated 2 weeks ago
4
huggingface/datasets #2294

Slow #0 when using map to tokenize.

Hi, _datasets_ is really amazing! I am following [run_mlm_no_trainer.py](url) to pre-train BERT, and it uses `tokenized_datasets = raw_datasets.map( tokenize_function, batche…

VerdureChen updated 3 years ago
3
edwardlib/observations #9

extend maybe_download_and_extract to manually downloaded dat…

It will only extract the files. Basically, extend the function to include `url=None`.

dustinvtran updated 7 years ago
1
soskek/bookcorpus #24

Can anyone download all the files in the url list file?

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot `HTTP Error: 403 Forbidden` How to fix this ? Or can i get the all the…

wxp16 updated 4 years ago
13
dbiir/UER-py #335

在训练BERT时， Loss突然增大且模型无法继续学习

我在训练BERT时使用了BookCorpus+Wikipedia-en数据，训练参数设置了batch_size=5120，warmup=0.1，learning_rate=4e-4，使用deep_init，没有用混合精度，steps(计算了40个epochs)=240k。但是在127k步左右突然Loss增大性能下降，且之后模型停止学习。请问这个可能是什么原因导致？ (Log如下所示) ![B…

xlxwalex updated 2 years ago
3

上一页 1...1 2 3 4 5 6 7...22 下一页

211 results for bookcorpus

211 results
for bookcorpus