Closed baisechundu closed 3 months ago
which size and which finetuning codebase are you using
which size and which finetuning codebase are you using
Model: Qwen-1.8b and Qwen1.5-1.8b Code:
https://github.com/QwenLM/Qwen1.5/blob/main/examples/sft/finetune.py
https://github.com/QwenLM/Qwen/blob/main/finetune.py
Due to hardware resource constraints, I conducted two comparison tests. Specifically, I used Qwen1.8B as a comparison model to continue pre-training, using the same docker image (Ubuntu 20.04.6 LTS):
Here Qwen1.8's Attention does not use any optimization, it uses https://huggingface.co/Qwen/Qwen-1_8B/blob/main/modeling_qwen.py#:~:text=class%20QWenAttention(nn.Module) Qwen1.5-1.8 uses Qwen2FlashAttention2, but it is still slower than qwen1.
Based on the above results, although the model architecture of qwen1 and qwen1.5 is consistent, the training time increased to 1.5 to 2 times is still unacceptable.
Here is some information about the installation package:
accelerate 0.27.2
aiohttp 3.9.3
aioprometheus 23.3.0
aiosignal 1.3.1
anyio 3.7.1
asttokens 2.0.5
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.1.0
auto_gptq 0.7.0
backcall 0.2.0
beautifulsoup4 4.12.2
boltons 23.0.0
brotlipy 0.7.0
certifi 2023.7.22
cffi 1.15.1
chardet 4.0.0
charset-normalizer 2.0.4
click 8.0.4
coloredlogs 15.0.1
conda 23.9.0
conda-build 3.27.0
conda-content-trust 0.2.0
conda_index 0.3.0
conda-libmamba-solver 23.7.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
cryptography 41.0.3
datasets 2.17.1
decorator 5.1.1
deepspeed 0.13.4
dill 0.3.8
dnspython 2.4.2
dropout-layer-norm 0.1
einops 0.7.0
exceptiongroup 1.0.4
executing 0.8.3
expecttest 0.1.6
fastapi 0.104.1
filelock 3.9.0
flash-attn 2.5.5
frozenlist 1.4.0
fsspec 2023.9.2
gekko 1.0.6
gmpy2 2.1.2
h11 0.14.0
hjson 3.1.0
httptools 0.6.1
huggingface-hub 0.19.4
humanfriendly 10.0
hypothesis 6.87.2
idna 3.4
ipython 8.15.0
jedi 0.18.1
Jinja2 3.1.2
jsonpatch 1.32
jsonpointer 2.1
jsonschema 4.20.0
jsonschema-specifications 2023.11.2
libarchive-c 2.9
libmambapy 1.4.1
MarkupSafe 2.1.1
matplotlib-inline 0.1.6
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
more-itertools 8.12.0
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.0
safetensors 0.4.1
sentencepiece 0.1.99
setuptools 68.0.0
six 1.16.0
sniffio 1.3.0
sortedcontainers 2.4.0
soupsieve 2.5
stack-data 0.2.0
starlette 0.27.0
sympy 1.11.1
tiktoken 0.6.0
tokenizers 0.15.0
tomli 2.0.1
toolz 0.12.0
torch 2.1.1
torchaudio 2.1.0
torchelastic 0.2.2
torchvision 0.16.0
tqdm 4.65.0
traitlets 5.7.1
transformers 4.37.2
transformers-stream-generator 0.0.4
triton 2.1.0
truststore 0.8.0
types-dataclasses 0.6.6
typing_extensions 4.8.0
tzdata 2023.3
urllib3 1.26.16
uvicorn 0.24.0.post1
uvloop 0.19.0
watchfiles 0.21.0
wcwidth 0.2.5
websockets 12.0
wheel 0.41.2
xformers 0.0.23
xxhash 3.4.1
yarn 1.9.4
Hi, Qwen1.0 code will automatically use flash-attention if installed and it appears that flash-attn, dropout-layer-norm (for rms_norm
), and triton (for apply_rotary_emb_func
) all exist in your environment, of which the latter two optimization is also missing from the transformers implementation.
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
Based on models of the same size, the training time for qwen1.5 is about 1.5 to 2 times that of qwen1, even if Qwen2FlashAttention2 is set up in qwen1.5. Can you provide some suggestions or solutions