swift显存控制不住的涨

2013358072 commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用swift框架FT的时候显存占用达到了245G 训练参数： --sft_type lora --gradient_accumulation_steps 4 --tuner_backend peft --target_modules DEFUALT 启动参数： HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

HIP_VISIBLE_DEVICES=0,1,2,4 PYTORCH_HIP_ALLOC_CONF=max_split_size_mb:4096 swift sft --model_type minicpm-v-v2_6-chat --dataset data1.jsonl --dataset_test_ratio 0.1 --sft_type lora --learning_rate 1e-4 --num_train_epochs 5 --model_id_or_path MiniCPM-V-2_6/ --grad ient_accumulation_steps 4 --tuner_backend peft

运行环境 | Environment

absl-py                       2.1.0
accelerate                    0.30.1
addict                        2.4.0
aiofiles                      23.2.1
aiohappyeyeballs              2.4.3
aiohttp                       3.10.10
aiosignal                     1.3.1
aitemplate                    0.0.1+das1.1.git5d8aa20.dtk2404.torch2.1.0
aliyun-python-sdk-core        2.16.0
aliyun-python-sdk-kms         2.16.5
altair                        5.3.0
annotated-types               0.7.0
anyio                         4.4.0
apex                          1.1.0+das1.1.gitf477a3a.abi1.dtk2404.torch2.1.0
async-timeout                 4.0.3
attrdict                      2.0.1
attrs                         23.2.0
binpacking                    1.5.2
bitsandbytes                  0.42.0+das1.1.gitce85679.abi1.dtk2404.torch2.1.0
blinker                       1.8.2
blis                          0.7.11
cachetools                    5.5.0
catalogue                     2.0.10
certifi                       2024.6.2
cffi                          1.17.1
charset-normalizer            3.3.2
click                         8.1.7
cloudpathlib                  0.16.0
colorama                      0.4.6
coloredlogs                   15.0.1
confection                    0.1.5
contourpy                     1.2.1
cpm-kernels                   1.0.11
crcmod                        1.7
cryptography                  43.0.3
cycler                        0.12.1
cymem                         2.0.8
dacite                        1.8.1
datasets                      2.21.0
deepspeed                     0.12.3+gita724046.abi1.dtk2404.torch2.1.0
dill                          0.3.8
distro                        1.9.0
dnspython                     2.6.1
docstring_parser              0.16
dropout-layer-norm            0.1+das1.1gitc7a8c18.abi1.dtk2404.torch2.1
editdistance                  0.6.2
einops                        0.7.0
email_validator               2.1.1
et-xmlfile                    1.1.0
exceptiongroup                1.2.1
fairscale                     0.4.0
fastapi                       0.115.3
fastapi-cli                   0.0.4
fastpt                        1.0.0+das1.1.abi1.dtk2404
ffmpy                         0.4.0
filelock                      3.15.1
fire                          0.6.0
flash-attn                    2.0.4+das1.1gitc7a8c18.abi1.dtk2404.torch2.1
flatbuffers                   24.3.25
fonttools                     4.53.0
frozenlist                    1.4.1
fsspec                        2024.6.0
fused-dense-lib               0.1+das1.1gitc7a8c18.abi1.dtk2404.torch2.1
future                        1.0.0
gitdb                         4.0.11
GitPython                     3.1.43
gradio                        4.26.0
gradio_client                 0.15.1
grpcio                        1.67.0
h11                           0.14.0
hjson                         3.1.0
httpcore                      1.0.5
httptools                     0.6.1
httpx                         0.27.0
huggingface-hub               0.23.4
humanfriendly                 10.0
hypothesis                    5.35.1
idna                          3.7
importlib_metadata            7.1.0
importlib_resources           6.4.5
jieba                         0.42.1
Jinja2                        3.1.4
jiter                         0.6.1
jmespath                      0.10.0
joblib                        1.4.2
jsonlines                     4.0.0
jsonschema                    4.22.0
jsonschema-specifications     2023.12.1
kiwisolver                    1.4.5
langcodes                     3.4.1
language_data                 1.2.0
layer-check-pt                1.2.3.git59a087a.abi1.dtk2404.torch2.1.0
lightop                       0.4+das1.1git8e60f07.abi1.dtk2404.torch2.1
lmdeploy                      0.2.6+das1.1.git6ba90df.abi1.dtk2404.torch2.1.0
lxml                          5.3.0
marisa-trie                   1.2.1
Markdown                      3.7
markdown-it-py                3.0.0
markdown2                     2.4.10
MarkupSafe                    2.1.5
matplotlib                    3.7.4
mdurl                         0.1.2
mmcv                          2.0.1+das1.1.gite58da25.abi1.dtk2404.torch2.1.0
mmengine                      0.10.4
mmengine-lite                 0.10.4
modelscope                    1.18.1
more-itertools                10.1.0
mpmath                        1.3.0
ms-swift                      2.5.1
msgpack                       1.0.8
multidict                     6.1.0
multiprocess                  0.70.16
murmurhash                    1.0.10
networkx                      3.3
ninja                         1.11.1.1
nltk                          3.8.1
numpy                         1.24.4
onnxruntime                   1.15.0+das1.1.git739f24d.abi1.dtk2404
openai                        1.52.2
opencv-python                 4.10.0.82
opencv-python-headless        4.5.5.64
openpyxl                      3.1.2
orjson                        3.10.5
oss2                          2.19.0
packaging                     23.2
pandas                        2.2.2
peft                          0.12.0
pillow                        10.3.0
pip                           24.0
platformdirs                  4.2.2
portalocker                   2.10.1
preshed                       3.0.9
prometheus_client             0.20.0
propcache                     0.2.0
protobuf                      4.25.0
psutil                        5.9.8
py-cpuinfo                    9.0.0
pyarrow                       17.0.0
pycparser                     2.22
pycryptodome                  3.21.0
pydantic                      2.7.4
pydantic_core                 2.18.4
pydeck                        0.9.1
pydub                         0.25.1
Pygments                      2.18.0
pynvml                        11.5.0
pyparsing                     3.1.2
python-dateutil               2.9.0.post0
python-dotenv                 1.0.1
python-multipart              0.0.9
pytz                          2024.1
PyYAML                        6.0.1
ray                           2.9.1
referencing                   0.35.1
regex                         2024.5.15
requests                      2.32.3
rich                          13.7.1
rotary-emb                    0.1+das1.1gitc7a8c18.abi1.dtk2404.torch2.1
rouge                         1.0.1
rpds-py                       0.18.1
ruff                          0.7.1
sacrebleu                     2.3.2
safetensors                   0.4.3
scipy                         1.13.1
seaborn                       0.13.0
semantic-version              2.10.0
sentencepiece                 0.1.99
setuptools                    69.5.1
shellingham                   1.5.4
shortuuid                     1.0.11
shtab                         1.7.1
simplejson                    3.19.3
six                           1.16.0
smart-open                    6.4.0
smmap                         5.0.1
sniffio                       1.3.1
socksio                       1.0.0
sortedcontainers              2.4.0
spacy                         3.7.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
srsly                         2.4.8
starlette                     0.41.0
streamlit                     1.39.0
sympy                         1.12.1
tabulate                      0.9.0
tenacity                      9.0.0
tensorboard                   2.18.0
tensorboard-data-server       0.7.2
termcolor                     2.4.0
thinc                         8.2.5
tiktoken                      0.7.0
timm                          0.9.10
tokenizers                    0.19.1
toml                          0.10.2
tomli                         2.0.1
tomlkit                       0.12.0
toolz                         1.0.0
torch                         2.1.0+das1.1.git3ac1bdd.abi1.dtk2404
torchaudio                    2.1.2+das1.1.git63d9a68.abi1.dtk2404.torch2.1.0
torchvision                   0.16.0+das1.1.git7d45932.abi1.dtk2404.torch2.1
tornado                       6.4.1
tqdm                          4.66.5
transformers                  4.40.0
transformers-stream-generator 0.0.5
triton                        2.1.0+das1.1.git4bf1007a.abi1.dtk2404.torch2.1.0
trl                           0.11.4
typer                         0.9.4
typing_extensions             4.12.2
tyro                          0.8.14
tzdata                        2024.1
ujson                         5.10.0
urllib3                       2.2.1
uvicorn                       0.24.0.post1
uvloop                        0.19.0
vllm                          0.3.3+das1.1.gitdf6349c.abi1.dtk2404.torch2.1.0
wasabi                        1.1.3
watchdog                      5.0.3
watchfiles                    0.22.0
weasel                        0.3.4
websockets                    11.0.3
Werkzeug                      3.0.5
xentropy-cuda-lib             0.1+das1.1gitc7a8c18.abi1.dtk2404.torch2.1
xformers                      0.0.25+das1.1.git8ef8bc1.abi1.dtk2404.torch2.1.0
xxhash                        3.5.0
yapf                          0.40.2
yarl                          1.16.0
zipp                          3.19.2

备注 | Anything else?

No response

2013358072 commented 1 month ago

在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON

zhaoyangwei123 commented 2 weeks ago

在第五个step的时候OOM Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 9.28 GiB. GPU 3 has a total capacty of 63.98 GiB of which 7.13 GiB is free. Of the allocated memory 45.88 GiB is allocated by PyTorch, and 7.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CON

你好，请问你解决了吗，我用8卡4090 lora int4模型都爆显存了

LDLINGLINGLING commented 1 week ago

你好，swift并不是我们直接在维护。推荐还是使用官方微调代码

OpenBMB / MiniCPM-V