-
I computed the sentence embedding of each sentence of bookcorpus data using bert base and saved them to disk. I used 20M sentences and the obtained arrow file is about 59GB while the original text fil…
-
I think I finished the step 5 /workspace/bert/data/create_datasets_from_start.sh in quick start guide. It took a whole day or so.
Now I am trying to do bash scripts/run_pretraining.sh benchmark.
…
-
Errors occurs when running `preprocessor.py` . The data is bookcorpus.
```
$ python3 -m preprocess --dataset=bookcorpus --shards=2048 --processes=64 --cache_dir=/data/bert_train/bookcorpus --tfrecor…
-
The `prepare_bookcorpus.py` file is missing on this README.md https://github.com/dmlc/gluon-nlp/tree/master/scripts/datasets/pretrain_corpus
(It should have been renamed `prepared_gutenberg.py`, and…
-
When I try to download the bookcorpus dataset, my connection keeps getting closed, and it eventually gives up:
```
Connecting to battle.shawwn.com (battle.shawwn.com)|2606:4700:3033::681b:80c6|:44…
aveni updated
4 years ago
-
## Describe the bug
When I try to concatenate 2 datasets (10GB each) , the entire data is loaded into memory instead of being written directly to disk.
Interestingly, this happens when trying to s…
-
In your dataset ,cuda run out of memory as long as the trainer begins:
however, without changing any other element/parameter,just switch dataset to `LineByLineTextDataset`,everything becames OK.
-
-
Here's the code I'm trying to run:
```python
dset_wikipedia = nlp.load_dataset("wikipedia", "20200501.en", split="train", cache_dir=args.cache_dir)
dset_wikipedia.drop(columns=["title"])
dset_wi…
-
Tried running the pretrain.py script when I got this error
```
process id: 76202
{'device': 'cuda:0', 'base_run_name': 'vanilla', 'seed': 11081, 'adam_bias_correction': False, 'schedule': 'origin…