Compared to qwen1, the training speed of qwen1.5 is too slow

baisechundu commented 6 months ago

Based on models of the same size, the training time for qwen1.5 is about 1.5 to 2 times that of qwen1, even if Qwen2FlashAttention2 is set up in qwen1.5. Can you provide some suggestions or solutions

JustinLin610 commented 5 months ago

which size and which finetuning codebase are you using

baisechundu commented 5 months ago

which size and which finetuning codebase are you using

Model: Qwen-1.8b and Qwen1.5-1.8b Code:

Qwen1.5 use this script： https://github.com/QwenLM/Qwen1.5/blob/main/examples/sft/finetune.py
Qwen1 use this script： https://github.com/QwenLM/Qwen/blob/main/finetune.py

Due to hardware resource constraints, I conducted two comparison tests. Specifically, I used Qwen1.8B as a comparison model to continue pre-training, using the same docker image (Ubuntu 20.04.6 LTS):

The same code runs different series of models: I made slight changes to the data processing part of qwen1's sft code, and it can run through Qwen1.8 and Qwen1.5-1.8, but the forward and backward time of each step of Qwen1.5-1.8 is significantly longer than that of Qwen1.8 .

Qwen1.8(one step): 0.7 seconds
Qwen1.5-1.8(one step): 0.95 seconds

Here Qwen1.8's Attention does not use any optimization, it uses https://huggingface.co/Qwen/Qwen-1_8B/blob/main/modeling_qwen.py#:~:text=class%20QWenAttention(nn.Module) Qwen1.5-1.8 uses Qwen2FlashAttention2, but it is still slower than qwen1.

Different codes run the same model: For Qwen1.5-1.8, the time taken for one step is as follows:
- qwen1_code(QWenAttention): 0.7 seconds
- Qwen1.5_code(flash-attn2): 0.95 seconds
- Qwen1.5_code(sdpa): 1.12 seconds
- Qwen1.5_code(eager): 2.6 seconds, the GPU memory usage is too high, about 1.5 times that of sdpa

Based on the above results, although the model architecture of qwen1 and qwen1.5 is consistent, the training time increased to 1.5 to 2 times is still unacceptable.

Here is some information about the installation package:

accelerate 0.27.2
aiohttp 3.9.3
aioprometheus 23.3.0
aiosignal 1.3.1
anyio 3.7.1
asttokens 2.0.5
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.1.0
auto_gptq 0.7.0
backcall 0.2.0
beautifulsoup4 4.12.2
boltons 23.0.0
brotlipy 0.7.0
certifi 2023.7.22
cffi 1.15.1
chardet 4.0.0
charset-normalizer 2.0.4
click 8.0.4
coloredlogs 15.0.1
conda 23.9.0
conda-build 3.27.0
conda-content-trust 0.2.0
conda_index 0.3.0
conda-libmamba-solver 23.7.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
cryptography 41.0.3
datasets 2.17.1
decorator 5.1.1
deepspeed 0.13.4
dill 0.3.8
dnspython 2.4.2
dropout-layer-norm 0.1
einops 0.7.0
exceptiongroup 1.0.4
executing 0.8.3
expecttest 0.1.6
fastapi 0.104.1
filelock 3.9.0
flash-attn 2.5.5
frozenlist 1.4.0
fsspec 2023.9.2
gekko 1.0.6
gmpy2 2.1.2
h11 0.14.0
hjson 3.1.0
httptools 0.6.1
huggingface-hub 0.19.4
humanfriendly 10.0
hypothesis 6.87.2
idna 3.4
ipython 8.15.0
jedi 0.18.1
Jinja2 3.1.2
jsonpatch 1.32
jsonpointer 2.1
jsonschema 4.20.0
jsonschema-specifications 2023.11.2
libarchive-c 2.9
libmambapy 1.4.1
MarkupSafe 2.1.1
matplotlib-inline 0.1.6
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
more-itertools 8.12.0
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.0
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 68.0.0
six 1.16.0
sniffio 1.3.0
sortedcontainers 2.4.0
soupsieve 2.5
stack-data 0.2.0
starlette 0.27.0
sympy 1.11.1
tiktoken 0.6.0
tokenizers 0.15.0
tomli 2.0.1
toolz 0.12.0
torch 2.1.1
torchaudio 2.1.0
torchelastic 0.2.2
torchvision 0.16.0
tqdm 4.65.0
traitlets 5.7.1
transformers 4.37.2
transformers-stream-generator 0.0.4
triton 2.1.0
truststore 0.8.0
types-dataclasses 0.6.6
typing_extensions 4.8.0
tzdata 2023.3
urllib3 1.26.16
uvicorn 0.24.0.post1
uvloop 0.19.0
watchfiles 0.21.0
wcwidth 0.2.5
websockets 12.0
wheel 0.41.2
xformers 0.0.23
xxhash 3.4.1
yarn 1.9.4

jklj077 commented 5 months ago

Hi, Qwen1.0 code will automatically use flash-attention if installed and it appears that flash-attn, dropout-layer-norm (for rms_norm), and triton (for apply_rotary_emb_func) all exist in your environment, of which the latter two optimization is also missing from the transformers implementation.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

QwenLM / Qwen2.5

Compared to qwen1, the training speed of qwen1.5 is too slow #251