lora微调grad_norm为nan，loss为0[BUG] <title>

tayton42 commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用同样的数据集训练，正常微调代码没有问题，但是lora微调代码显示grad_norm为nan，在第一步后loss也变为0

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.313247675185968e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0814483193289963e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.626495350371936e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.0492666725598114e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.394695994514964e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.6867523248140324e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9397430255579027e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.1628966386579925e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.36251434774578e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.5430905306402606e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.707943669700932e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.859593857014607e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.130714991888808e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.252990700743872e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.367851073681039e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.47614431384396e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.578580931388297e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.675762022931747e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.768200644143028e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.856338205826227e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.940557222351561e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.0211913448869e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.098533345119623e-07, 'epoch': 0.0}

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

#!/bin/bash

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="/localpath/MiniCPM-Llama3-V-2_5" # or openbmb/MiniCPM-V-2
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/opt/cv/tianyutong/prismatic-vlms/data/download/llava-v1.5-instruct/llava_v1_5_mix665k_minicpm_allwithimage.json"
LLM_TYPE="llama3" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 true \
    --bf16_full_eval true \
    --do_train \
    --tune_vision true \
    --tune_llm false \
    --use_lora true \
    --lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj)" \
    --model_max_length 2048 \
    --max_slice_nums 9 \
    --scale_resolution 448 \
    --num_train_epochs  1 \
    --eval_steps 1000 \
    --output_dir output/output_minicpmv2_lora \
    --logging_dir output/output_minicpmv2_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 10000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero2.json \
    --report_to "tensorboard" # wandb

运行环境 | Environment

absl-py                       2.1.0
accelerate                    0.30.1
addict                        2.4.0
aiofiles                      23.2.1
aiohttp                       3.9.5
aiosignal                     1.3.1
aliyun-python-sdk-core        2.15.1
aliyun-python-sdk-kms         2.16.2
altair                        5.3.0
annotated-types               0.6.0
antlr4-python3-runtime        4.9.3
anyio                         4.3.0
appdirs                       1.4.4
archspec                      0.2.1
ascii-magic                   2.3.0
asttokens                     2.0.5
astunparse                    1.6.3
async-timeout                 4.0.3
attrs                         23.1.0
azure-core                    1.30.1
azure-identity                1.16.0
azure-storage-blob            12.19.1
azure-storage-file-datalake   12.14.0
backcall                      0.2.0
bcrypt                        4.1.2
beautifulsoup4                4.12.2
bitsandbytes                  0.43.1
bleach                        6.1.0
blinker                       1.7.0
blis                          0.7.11
boltons                       23.0.0
boto3                         1.34.86
botocore                      1.34.86
braceexpand                   0.1.7
Brotli                        1.0.9
cachetools                    5.3.3
catalogue                     2.0.10
certifi                       2023.11.17
cffi                          1.16.0
cfgv                          3.4.0
chardet                       4.0.0
charset-normalizer            2.0.4
circuitbreaker                1.4.0
click                         8.1.7
cloudpathlib                  0.16.0
colorama                      0.4.6
conda                         23.9.0
conda-build                   3.28.1
conda-content-trust           0.2.0
conda_index                   0.3.0
conda-libmamba-solver         23.7.0
conda-package-handling        2.2.0
conda_package_streaming       0.9.0
confection                    0.1.4
contexttimer                  0.3.3
contourpy                     1.2.1
cramjam                       2.8.3
crcmod                        1.7
cryptography                  41.0.7
cycler                        0.12.1
cymem                         2.0.8
decorator                     5.1.1
decord                        0.6.0
deepspeed                     0.14.2
diffusers                     0.16.0
distlib                       0.3.8
distro                        1.8.0
dnspython                     2.4.2
docker-pycreds                0.4.0
draccus                       0.7.2
dropout-layer-norm            0.1
einops                        0.7.0
einops-exts                   0.0.4
exceptiongroup                1.0.4
executing                     0.8.3
expecttest                    0.1.6
fairscale                     0.4.4
fastapi                       0.110.1
ffmpy                         0.3.2
filelock                      3.13.1
flash-attn                    2.3.3
fonttools                     4.50.0
frozenlist                    1.4.1
fsspec                        2023.12.2
ftfy                          6.2.0
gitdb                         4.0.11
GitPython                     3.1.40
gmpy2                         2.1.2
google-api-core               2.18.0
google-auth                   2.29.0
google-cloud-core             2.4.1
google-cloud-storage          2.10.0
google-crc32c                 1.5.0
google-resumable-media        2.7.0
googleapis-common-protos      1.63.0
gradio                        3.35.2
gradio_client                 0.2.9
grpcio                        1.64.1
h11                           0.14.0
hjson                         3.1.0
httpcore                      0.17.3
httpx                         0.24.0
huggingface-hub               0.23.0
hypothesis                    6.92.0
identify                      2.5.35
idna                          3.4
imageio                       2.33.1
importlib_metadata            7.1.0
iopath                        0.1.10
ipython                       8.15.0
isodate                       0.6.1
jedi                          0.18.1
Jinja2                        3.1.2
jmespath                      0.10.0
joblib                        1.4.0
jsonlines                     4.0.0
jsonpatch                     1.32
jsonpointer                   2.1
jsonschema                    4.19.2
jsonschema-specifications     2023.7.1
kaggle                        1.6.12
kiwisolver                    1.4.5
langcodes                     3.3.0
lazy_loader                   0.4
libarchive-c                  2.9
libmambapy                    1.5.3
linkify-it-py                 2.0.3
llava                         1.1.1
lxml                          5.2.1
Markdown                      3.6
markdown-it-py                2.2.0
markdown2                     2.4.13
MarkupSafe                    2.1.1
matplotlib                    3.8.3
matplotlib-inline             0.1.6
mdit-py-plugins               0.3.3
mdurl                         0.1.2
menuinst                      2.0.1
mergedeep                     1.3.4
mkl-fft                       1.3.8
mkl-random                    1.2.4
mkl-service                   2.4.0
mmcv                          1.7.0
mmdet                         2.25.2
model-index                   0.1.11
more-itertools                10.1.0
mosaicml-streaming            0.7.5
mpmath                        1.3.0
msal                          1.28.0
msal-extensions               1.1.0
multidict                     6.0.5
murmurhash                    1.0.10
mypy-extensions               1.0.0
networkx                      3.1
ninja                         1.11.1.1
nodeenv                       1.8.0
numpy                         1.26.2
oci                           2.125.3
omegaconf                     2.3.0
openai                        1.21.2
opencv-python                 4.7.0.72
opencv-python-headless        4.5.5.64
opendatalab                   0.0.10
opendatasets                  0.1.22
openmim                       0.3.9
openxlab                      0.0.38
ordered-set                   4.1.0
orjson                        3.10.1
oss2                          2.17.0
packaging                     23.1
pandas                        2.1.4
paramiko                      3.4.0
parso                         0.8.3
peft                          0.10.0
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.0.1
pip                           23.3.1
pkginfo                       1.9.6
platformdirs                  3.10.0
plotly                        5.21.0
pluggy                        1.0.0
portalocker                   2.8.2
pre-commit                    3.7.0
preshed                       3.0.9
prismatic                     0.0.1
prompt-toolkit                3.0.36
proto-plus                    1.23.0
protobuf                      4.25.1
psutil                        5.9.0
ptyprocess                    0.7.0
pure-eval                     0.2.2
py-cpuinfo                    9.0.0
pyarrow                       15.0.2
pyasn1                        0.6.0
pyasn1_modules                0.4.0
pycocoevalcap                 1.2
pycocotools                   2.0.7
pycosat                       0.6.6
pycparser                     2.21
pycryptodome                  3.20.0
pydantic                      1.10.14
pydantic_core                 2.18.1
pydeck                        0.8.1b0
pydub                         0.25.1
Pygments                      2.15.1
PyJWT                         2.8.0
pymongo                       4.6.3
PyNaCl                        1.5.0
pynvml                        11.5.0
pyOpenSSL                     23.2.0
pyparsing                     3.1.2
PySocks                       1.7.1
python-dateutil               2.8.2
python-etcd                   0.4.5
python-magic                  0.4.27
python-multipart              0.0.9
python-slugify                8.0.4
python-snappy                 0.7.1
pytz                          2023.3.post1
PyYAML                        6.0.1
pyyaml-include                1.4.1
referencing                   0.30.2
regex                         2023.12.25
requests                      2.28.2
rich                          13.4.2
rpds-py                       0.10.6
rsa                           4.9
ruamel.yaml                   0.17.21
ruamel.yaml.clib              0.2.6
s3transfer                    0.10.1
safetensors                   0.4.1
salesforce-lavis              1.0.1
scikit-image                  0.23.1
scikit-learn                  1.4.2
scipy                         1.13.0
semantic-version              2.10.0
sentencepiece                 0.2.0
sentry-sdk                    1.39.1
setproctitle                  1.3.3
setuptools                    60.2.0
shortuuid                     1.0.13
six                           1.16.0
smart-open                    6.4.0
smmap                         5.0.1
sniffio                       1.3.1
sortedcontainers              2.4.0
soupsieve                     2.5
spacy                         3.7.3
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
srsly                         2.4.8
stack-data                    0.2.0
starlette                     0.37.2
streamlit                     1.33.0
svgwrite                      1.4.3
sympy                         1.12
tabulate                      0.9.0
tenacity                      8.2.3
tensorboard                   2.16.2
tensorboard-data-server       0.7.2
terminaltables                3.1.10
text-unidecode                1.3
thinc                         8.2.3
threadpoolctl                 3.4.0
tifffile                      2024.4.18
tiktoken                      0.5.2
timm                          0.9.16
tokenizers                    0.19.1
toml                          0.10.2
tomli                         2.0.1
toolz                         0.12.0
torch                         2.1.2
torchaudio                    2.1.2
torchelastic                  0.2.2
torchvision                   0.16.2
tornado                       6.4
tqdm                          4.65.0
traitlets                     5.7.1
transformers                  4.41.0
transformers-stream-generator 0.0.4
triton                        2.1.0
truststore                    0.8.0
typer                         0.9.4
types-dataclasses             0.6.6
typing_extensions             4.11.0
typing-inspect                0.9.0
tzdata                        2023.4
uc-micro-py                   1.0.3
urllib3                       1.26.18
uvicorn                       0.29.0
virtualenv                    20.25.3
vlm_eval                      0.0.1
wandb                         0.16.6
wasabi                        1.1.2
watchdog                      4.0.0
wavedrom                      2.0.3.post3
wcwidth                       0.2.13
weasel                        0.3.4
webdataset                    0.2.86
webencodings                  0.5.1
websockets                    12.0
Werkzeug                      3.0.3
wheel                         0.41.2
xxhash                        3.4.1
yapf                          0.40.2
yarl                          1.9.4
zipp                          3.18.1
zstandard                     0.19.0
zstd                          1.5.5.1

备注 | Anything else?

No response

tayton42 commented 1 month ago

经过测试，如果我把tune_vision设为false就会正常

qyc-98 commented 1 month ago

这是因为lora的get peft model会自动把模型除了lora以外的部分的requires grad都设置为false，无法参与训练，tune_vision会使得resampler和vpm都参与训练，可能这样有利于您的finetune，我们会更新代码，默认训练resampler，并且经过实验，对于大部分情况我们建议默认训练resampler；对于vpm部分可以根据需要调节

qyc-98 commented 1 month ago

您好关于lora微调我们即将跟新一版代码，解决其中存在的一些问题，建议您在更新代码后重新进行lora微调。这次更新主要是解决了lora微调后，模型的视觉部分参数没有正常保存，导致您的训练失效，我们深感抱歉。您可以参考最新的lora加载方式，参见finetune下的readme.md。谢谢您的支持

LongIslandWithoutIceTea commented 1 month ago

经过测试，如果我把tune_vision设为false就会正常

最新的代码好像 tune_vision为true还是同样有nan的问题想问问您那边有解决吗

OpenBMB / MiniCPM-V