alibaba / EasyNLP

EasyNLP: A Comprehensive and Easy-to-use NLP Toolkit
Apache License 2.0
2.03k stars 250 forks source link

fashionklip 分布式训练训练代码显存泄露 #359

Closed fangxin2github closed 4 months ago

fangxin2github commented 4 months ago

image 使用fashionklip的给出的训练代码训练

export PYTHONPATH="$PYTHONPATH:$PWD/src"
DATAPATH=./tmp
PRETRAINED_MODEL=./tmp/pretrained_models

MASTER_ADDR=tcp://127.0.0.1:12345

python3 -u training/main_all_concept.py \
        --save-most-recent \
        --save-frequency 1 \
        --report-to tensorboard \
        --train-data="${DATAPATH}/fashiongen/train/fashion-gen_train_queries_phrases.jsonl"  \
        --train-img="${DATAPATH}/fashiongen/train/train.224.npz" \
        --txt-id-filename="${DATAPATH}/fashiongen/train/fashion-gen_concepts_queries_filtered.jsonl" \
        --kb-txt-id-filename="${DATAPATH}/fashion_kb/concepts_queries.jsonl" \
        --val-data="${DATAPATH}/fashiongen/val/fashion-gen_val_queries.jsonl"  \
        --val-img="${DATAPATH}/fashiongen/val/val.224.npz" \
        --img-data-sets="${DATAPATH}/fashion_kb/concepts_images_sample.224.npz" \
        --concept-data="${DATAPATH}/fashiongen/train/fashion-gen_concepts_queries_filtered.jsonl" \
        --kb-concept-data="${DATAPATH}/fashion_kb/concepts_queries.jsonl" \
        --resume="${PRETRAINED_MODEL}/pai-clip-commercial-base-en/pai-clip-commercial-base-en.pt" \
        --is-concept \
        --is-data-concept \
        --is-update \
        --dist-url=$MASTER_ADDR \
        --dataset-type jsonl \
        --warmup 500 \
        --batch-size=32 \
        --eval-batch-size=64 \
        --lr=1e-5 \
        --wd=0.001 \
        --epochs=20 \
        --workers=0 \
        --model ViT-B/32 \

随着每个batch训练,显存逐渐增大,自己找不到显存泄露的地方,求帮助。

fangxin2github commented 4 months ago

debug时发现是在get_loss函数内发生了内存泄露,应该是这个地方。 1714293467669 将训练方式更换为dp,就没有发生显存泄露了。but在else条件分支下,有好几个变量都没有被定义。 1714293555383

fangxin2github commented 4 months ago

image 问题已解决,找到显存泄露的地方并修改文件是./examples/fashionklip/training/train_all_concept.py。但是不太清楚为什么这样子会发生显存泄露,其中LA是from torch import linalg as LA

fangxin2github commented 4 months ago

希望有大佬能解惑一下