dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[Performance] Speed comparison between GluonNLP and other packages #1436

Open sxjscience opened 3 years ago

sxjscience commented 3 years ago

Description

Similar to the efforts in our recently added benchmarking script (https://github.com/dmlc/gluon-nlp/tree/master/scripts/benchmarks), we are also interested in comparing the end-to-end training speed of training NLP models between GluonNLP and the other packages. This helps us track the performance of GluonNLP and measure the out-of-the-box speed of different toolkits. (This also helps with the goal of democratizing NLP to everyone). @ZheyuYe has helped in trying out huggingface/transformer to see its performance on a g4.12dn instance.

Huggingface command:

export SQUAD_DIR=/home/ubuntu/squad
python3 -m torch.distributed.launch --nproc_per_node=4 ./examples/question-answering/run_squad.py \
    --model_type albert \
    --model_name_or_path albert-base-v2 \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --max_grad_norm 1.0 \
    --num_train_epochs 3 \
    --warmup_ratio 0.1 \
    --max_seq_length 512 \
    --doc_stride 128 \
    --output_dir ./examples/models/albert-base-v2_finetuned_squad2.0/ \
    --per_gpu_eval_batch_size=24   \
    --per_gpu_train_batch_size=12   \
    --gradient_accumulation_steps=1 \
    --overwrite_cache \
    --threads 8 \
    --overwrite_output_dir \

The whole log is available in https://gist.github.com/sxjscience/9ef5c957bb4447e8fd35ccd4d96328f0. From the log, the huggingface training and evaluation starts at 11/15/2020 14:36:09 and finishes in 11/15/2020 17:36:20, which takes roughly 3 hours.

The training log of albert-base in gluonnlp is attached here: (also see the question answering examples in https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering). It started at 2020-11-05 15:45:33,862 and finished at 2020-11-05 17:57:04,681, which is roughly 2hour and 15 minutes. Thus, the QA implementation in GluonNLP is somewhat faster than that in huggingface. However, the comparison is not totally fair since we are using different ways for preprocessing the training samples.

We may try to maintain our own benchmark of end-to-end training performance and extend the comparison to other packages like DeepSpeed. I opened this issue to track the status.

@dmlc/gluon-nlp-team

zheyuye commented 3 years ago

With the following command, I re-run the above experiment with fp16

export SQUAD_DIR=/home/ubuntu/squad
python3 -m torch.distributed.launch --nproc_per_node=4 ./examples/question-answering/run_squad.py \
    --model_type albert \
    --model_name_or_path albert-base-v2 \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --max_grad_norm 1.0 \
    --num_train_epochs 3 \
    --warmup_ratio 0.1 \
    --max_seq_length 512 \
    --doc_stride 128 \
    --output_dir ./examples/models/albert-base-v2_finetuned_squad2.0-fp16/ \
    --per_gpu_eval_batch_size=24   \
    --per_gpu_train_batch_size=12   \
    --gradient_accumulation_steps=1 \
    --overwrite_cache \
    --threads 8 \
    --overwrite_output_dir \
    --fp16 \

The training process tooks roughly 100 minutes from 11/25/2020 17:09:34 to 11/25/2020 18:50:03. Compared to the original 3 hours, it saves a lot of time but also loses some accuracy with final evaluation result EM/F1=75.93/79.42.

See the whole log for the details.

szha commented 3 years ago

@ZheyuYe thanks for the update. So with fp16 the training is about 30m faster (180m->100m) while the evaluation performance loss is ~3% (79.7/82.6 -> 75.9/79.4). The loss is somewhat higher than expected for fp16 training.

sxjscience commented 3 years ago

Note that it’s the Huggingface results. May be we are not calling Huggingface correctly.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Sheng Zha notifications@github.com Sent: Thursday, November 26, 2020 9:51:07 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Author author@noreply.github.com Subject: Re: [dmlc/gluon-nlp] [Performance] Speed comparison between GluonNLP and other packages (#1436)

@ZheyuYehttps://github.com/ZheyuYe thanks for the update. So with fp16 the training is about 30m faster (135m->100m) while the evaluation performance loss is ~3% (79.7/82.6 -> 75.9/79.4). The loss is somewhat higher than expected.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1436#issuecomment-734424784, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3XS5SS3QRXXU6TGKR3SR2IQXANCNFSM4TYC5KJQ.

sxjscience commented 3 years ago

Regarding the comparison with DeepSpeed, I recently created this markdown and we can follow the same setup to run DeepSpeed-accelerated BERT-large:

https://github.com/sxjscience/DeepSpeedExamples/tree/master/BingBertSquad

DOUDOU0314 commented 3 years ago

I compared the SQuAD 1.1 finetuning speed of GluonNLP v.s. Deepspeed

https://gist.github.com/DOUDOU0314/01ea3e74b255d302705ff4b77744a72d