Open xingfeng01 opened 3 years ago
Maybe you can try to change this line? https://github.com/bytedance/lightseq/blob/caa86a2f25d766b10f6a3865fd969933d17a697d/examples/training/huggingface/run_ner.sh#L26
I already downloaded the dataset to disk using https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh. I am wondering how to use those data ? Thanks !
I got following error with dataset "wikicorpus"
CMD: python3.7 -m torch.distributed.launch \ --nproc_per_node=1 \ $THIS_DIR/run_ner.py \ --model_name_or_path bert-large-uncased \ --dataset_name wikicorpus \ --dataset_config_name raw_en \ --output_dir ./test-ner-no-wikicorpus \ --cache_dir ./cache-wikicorpus \ --do_train \ --do_eval \ --num_train_epochs 1
Error message:
Downloading and preparing dataset wikicorpus/raw_en (download: 1.25 GiB, generated: 3.16 GiB, post-processed: Unknown size, total: 4.41 GiB) to ./cache-wikicorpus/wikicorpus/raw_en/0.0.0/8665d716c08f102e87fdbb711326cbdf12c7ce810962819f1c71ca294d722774...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.35G/1.35G [15:12<00:00, 1.48MB/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1103, in _prepare_split
writer.write(example, key)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 342, in write
self.check_duplicate_keys()
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 353, in check_duplicate_keys
raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 519
Keys should be unique and deterministic in nature
During handling of the above exception, another exception occurred:
It seems a dataset error, not lightseq's error. Anyway, you can try to delete the cache.
I tried several times, but the error is same, could you have a look ?
I tried several times, but the error is same, could you have a look ?
You can check https://github.com/huggingface/datasets/issues/2552
Hi ! Thanks for reporting this issue with wikicorpus
, we implemented a fix in https://github.com/huggingface/datasets/pull/2844
Hi! I run the example:
sh examples/training/huggingface/run_ner.sh
but get the error:
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 15.78 GiB total capacity; 7.21 GiB already allocated; 45.75 MiB free; 7.53 GiB reserved in total by PyTorch)
0%|
How can I solve it?
when I set per_device_train_batch_size=1, get the error:
File "/usr/local/python3.7.0/lib/python3.7/site-packages/lightseq/training/ops/pytorch/transformer_encoder_layer.py", line 288, in forward
assert bs == encoder_padding_mask.size(0) and sl == encoder_padding_mask.size(1)
AssertionError
This error is caused by the wrong padding mask shape, anyway, batch_size=1 is not usual. If you encounter GPU out of memory, besides decreasing your batch_size, remember to set smaller max_batch_tokens in lightseq layer config, it will also influence GPU memory usage
When I set max_batch_tokens=1024 in ls_hf_transformer_encoder_layer.py, I still get the following error:
f"Batch token numbers {bs * sl} exceeds the limit {self.config.max_batch_tokens}."
ValueError: Batch token numbers 1344 exceeds the limit 1024.
May I change this place?
Yes, that is the parameter you should change according to your training data size, but be careful that it may bring more GPU memory usage.
On Mon, Sep 27, 2021 at 5:00 PM zhangkaihuo @.***> wrote:
When I set max_batch_tokens=1024 in ls_hf_transformer_encoder_layer.py https://github.com/bytedance/lightseq/blob/master/examples/training/huggingface/ls_hf_transformer_encoder_layer.py#L21, I still get the following error:
f"Batch token numbers {bs * sl} exceeds the limit {self.config.max_batch_tokens}." ValueError: Batch token numbers 1344 exceeds the limit 1024.
May I change this https://github.com/bytedance/lightseq/blob/master/examples/training/huggingface/ls_hf_transformer_encoder_layer.py#L21 place?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/99#issuecomment-927668484, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAKI4UWJOLEULKTORIDUEAXD3ANCNFSM5ACII3GQ .
Hi, I want to run examples/pytorch/token-classification training example with wiki dataset (already downloaded), I am wondering what parameters should be used. Thanks.