bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation
Other
3.17k stars 328 forks source link

How to run examples/pytorch/token-classification with dataset #99

Open xingfeng01 opened 3 years ago

xingfeng01 commented 3 years ago

Hi, I want to run examples/pytorch/token-classification training example with wiki dataset (already downloaded), I am wondering what parameters should be used. Thanks.

Taka152 commented 3 years ago

Maybe you can try to change this line? https://github.com/bytedance/lightseq/blob/caa86a2f25d766b10f6a3865fd969933d17a697d/examples/training/huggingface/run_ner.sh#L26

xingfeng01 commented 3 years ago

I already downloaded the dataset to disk using https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh. I am wondering how to use those data ? Thanks !

xingfeng01 commented 3 years ago

I got following error with dataset "wikicorpus"

CMD: python3.7 -m torch.distributed.launch \ --nproc_per_node=1 \ $THIS_DIR/run_ner.py \ --model_name_or_path bert-large-uncased \ --dataset_name wikicorpus \ --dataset_config_name raw_en \ --output_dir ./test-ner-no-wikicorpus \ --cache_dir ./cache-wikicorpus \ --do_train \ --do_eval \ --num_train_epochs 1

Error message: Downloading and preparing dataset wikicorpus/raw_en (download: 1.25 GiB, generated: 3.16 GiB, post-processed: Unknown size, total: 4.41 GiB) to ./cache-wikicorpus/wikicorpus/raw_en/0.0.0/8665d716c08f102e87fdbb711326cbdf12c7ce810962819f1c71ca294d722774... Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.35G/1.35G [15:12<00:00, 1.48MB/s] Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1103, in _prepare_split writer.write(example, key) File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 342, in write self.check_duplicate_keys() File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 353, in check_duplicate_keys raise DuplicatedKeysError(key) datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET ! Found duplicate Key: 519 Keys should be unique and deterministic in nature

During handling of the above exception, another exception occurred:

Taka152 commented 3 years ago

It seems a dataset error, not lightseq's error. Anyway, you can try to delete the cache.

xingfeng01 commented 3 years ago

I tried several times, but the error is same, could you have a look ?

Taka152 commented 3 years ago

I tried several times, but the error is same, could you have a look ?

You can check https://github.com/huggingface/datasets/issues/2552

lhoestq commented 3 years ago

Hi ! Thanks for reporting this issue with wikicorpus, we implemented a fix in https://github.com/huggingface/datasets/pull/2844

zkh2016 commented 3 years ago

Hi! I run the example:

 sh examples/training/huggingface/run_ner.sh

but get the error:

RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 15.78 GiB total capacity; 7.21 GiB already allocated; 45.75 MiB free; 7.53 GiB reserved in total by PyTorch)
  0%|

How can I solve it?

zkh2016 commented 3 years ago

when I set per_device_train_batch_size=1, get the error:

 File "/usr/local/python3.7.0/lib/python3.7/site-packages/lightseq/training/ops/pytorch/transformer_encoder_layer.py", line 288, in forward
    assert bs == encoder_padding_mask.size(0) and sl == encoder_padding_mask.size(1)
AssertionError
Taka152 commented 2 years ago

This error is caused by the wrong padding mask shape, anyway, batch_size=1 is not usual. If you encounter GPU out of memory, besides decreasing your batch_size, remember to set smaller max_batch_tokens in lightseq layer config, it will also influence GPU memory usage

zkh2016 commented 2 years ago

When I set max_batch_tokens=1024 in ls_hf_transformer_encoder_layer.py, I still get the following error:

   f"Batch token numbers {bs * sl} exceeds the limit {self.config.max_batch_tokens}."
ValueError: Batch token numbers 1344 exceeds the limit 1024.

May I change this place?

Taka152 commented 2 years ago

Yes, that is the parameter you should change according to your training data size, but be careful that it may bring more GPU memory usage.

On Mon, Sep 27, 2021 at 5:00 PM zhangkaihuo @.***> wrote:

When I set max_batch_tokens=1024 in ls_hf_transformer_encoder_layer.py https://github.com/bytedance/lightseq/blob/master/examples/training/huggingface/ls_hf_transformer_encoder_layer.py#L21, I still get the following error:

f"Batch token numbers {bs * sl} exceeds the limit {self.config.max_batch_tokens}." ValueError: Batch token numbers 1344 exceeds the limit 1024.

May I change this https://github.com/bytedance/lightseq/blob/master/examples/training/huggingface/ls_hf_transformer_encoder_layer.py#L21 place?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/99#issuecomment-927668484, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAKI4UWJOLEULKTORIDUEAXD3ANCNFSM5ACII3GQ .