Closed great-wind closed 6 days ago
操作系统:centos7 python版本:Python 3.11.9 系统安装cuda版本信息: NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
其它包列表:
Package Version --------------------------------- --------------- accelerate 0.30.1 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 arxiv 2.1.0 asttokens 2.4.1 async-lru 2.0.4 attrs 23.2.0 Babel 2.15.0 beautifulsoup4 4.12.3 bleach 6.1.0 blinker 1.8.2 cachetools 5.3.3 certifi 2024.6.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.3 comm 0.2.2 contourpy 1.2.1 cpm-kernels 1.0.11 cycler 0.12.1 dataclasses-json 0.6.6 datasets 2.19.2 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.13.1 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 email_validator 2.1.1 executing 2.0.1 fastapi 0.111.0 fastapi-cli 0.0.4 fastjsonschema 2.19.1 feedparser 6.0.10 ffmpy 0.3.2 filelock 3.14.0 fonttools 4.53.0 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 gradio 4.32.2 gradio_client 0.17.0 greenlet 3.0.3 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.2 idna 3.7 importlib_resources 6.4.0 interegular 0.3.3 ipykernel 6.29.4 ipython 8.25.0 ipywidgets 8.1.3 isoduration 20.11.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 json5 0.9.25 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 jupyter 1.0.0 jupyter_client 8.6.2 jupyter-console 6.6.3 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.1 jupyter_server_terminals 0.5.3 jupyterlab 4.2.1 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.2 jupyterlab_widgets 3.0.11 kiwisolver 1.4.5 langchain 0.2.1 langchain-community 0.2.1 langchain-core 0.2.3 langchain-text-splitters 0.2.0 langchainhub 0.1.17 langsmith 0.1.68 lark 1.1.9 latex2mathml 3.77.0 llvmlite 0.42.0 lm-format-enforcer 0.10.1 loguru 0.7.2 Markdown 3.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 marshmallow 3.21.2 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdtex2html 1.3.0 mdurl 0.1.2 mistune 3.0.2 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 notebook 7.2.0 notebook_shim 0.2.4 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cudnn-cu11 8.7.0.84 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.3.0.86 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusparse-cu11 11.7.5.86 nvidia-ml-py 12.555.43 nvidia-nccl-cu11 2.20.5 nvidia-nvtx-cu11 11.8.86 openai 1.31.0 orjson 3.10.3 outlines 0.0.34 overrides 7.7.0 packaging 23.2 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 peft 0.11.1 pexpect 4.9.0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt_toolkit 3.0.45 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pycparser 2.22 pydantic 2.7.3 pydantic_core 2.18.4 pydeck 0.9.1 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-json-logger 2.0.7 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 qtconsole 5.5.2 QtPy 2.4.1 ray 2.23.0 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.18.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.4.7 safetensors 0.4.3 scikit-learn 1.5.0 scipy 1.13.1 semantic-version 2.10.0 Send2Trash 1.8.3 sentence-transformers 3.0.0 sentencepiece 0.2.0 setuptools 69.5.1 sgmllib3k 1.0.0 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.5 SQLAlchemy 2.0.30 sse-starlette 2.1.0 stack-data 0.6.3 starlette 0.37.2 streamlit 1.35.0 sympy 1.12.1 tenacity 8.3.0 terminado 0.18.1 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.3 tinycss2 1.3.0 tokenizers 0.19.1 toml 0.10.2 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.0+cu118 torchaudio 2.3.0+cu118 torchvision 0.18.0+cu118 tornado 6.4 tqdm 4.66.4 traitlets 5.14.3 transformers 4.40.0 triton 2.3.0 typer 0.12.3 types-python-dateutil 2.9.0.20240316 types-requests 2.32.0.20240602 typing_extensions 4.12.1 typing-inspect 0.9.0 tzdata 2024.1 ujson 5.10.0 uri-template 1.3.0 urllib3 2.2.1 uvicorn 0.30.1 uvloop 0.19.0 vllm 0.4.3 vllm-flash-attn 2.5.8.post2 watchdog 4.0.1 watchfiles 0.22.0 wcwidth 0.2.13 webcolors 1.13 webencodings 0.5.1 websocket-client 1.8.0 websockets 11.0.3 wheel 0.43.0 widgetsnbextension 4.0.11 xformers 0.0.26.post1 xxhash 3.4.1 yarl 1.9.4
No response
data_config: train_file: train.json val_file: dev.json test_file: dev.json num_proc: 8 #16 max_input_length: 256 max_output_length: 512 training_args: # see `transformers.Seq2SeqTrainingArguments` output_dir: ./output #./output max_steps: 1000 #3000 # needed to be fit for the dataset learning_rate: 5e-5 # settings for data loading per_device_train_batch_size: 8 #4 dataloader_num_workers: 8 #16 remove_unused_columns: false # settings for saving checkpoints save_strategy: steps save_steps: 20 #500 # settings for logging log_level: info logging_strategy: steps logging_steps: 10 # settings for evaluation per_device_eval_batch_size: 16 #16 evaluation_strategy: steps eval_steps: 20 # 500 # settings for optimizer # adam_epsilon: 1e-6 # uncomment the following line to detect nan or inf values # debug: underflow_overflow predict_with_generate: true # see `transformers.GenerationConfig` generation_config: max_new_tokens: 512 # set your absolute deepspeed path here #deepspeed: ds_zero_2.json # set to true if train with cpu. use_cpu: false peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1
2.用finetune_demo/inference_hf.py做推理时,选用早期epoch的权重可以正常运行,选用后期的epoch的权重则无法正常运行,测试运行记录如下: 调用100epoch保存的权重,正常推理: ```python # python inference_hf.py /home/ChatGLM3-20240530/finetune_demo/output/checkpoint-100/ --prompt "你是一名数据库开发人员,你精通数据库的sql代码编写,根据用户输入编写sql代码。用户输入:项目基本信息表中电压等级都有哪些" Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.36it/s] 为了回答您的问题,我需要了解您所提到的“项目基本信息表”的结构和字段。但是,根据您提供的信息,我可以为您提供一个通用的SQL查询语句,用于检索电压等级的列名。 假设您的表名为`project_basic_info`,电压等级的列名称为`voltage_level`,您可以使用以下SQL查询语句来获取该列的所有值: SELECT DISTINCT voltage_level FROM project_basic_info; 这条SQL查询语句将返回`project_basic_info`表中所有不同的电压等级。如果您的表结构或列名与这些不同,请提供详细信息,以便我可以为您提供更精确的查询语句。
调用500epoch保存的权重,正常推理:
# python inference_hf.py /home/ChatGLM3-20240530/finetune_demo/output/checkpoint-500/ --prompt "你是一名数据.开发人员,你精通数据库的sql代码编写,根据用户输入编写sql代码。用户输入:项目基本信息表中电压等级都有哪些" Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.34it/s] SELECT DISTINCT zjmc FROM un_szhsj.dwd_prj_sjtxjs_project_baseinfo;
调用700epoch保存的权重,正常推理:
# python inference_hf.py /home/GitHub/ChatGLM3-20240530/finetune_demo/output/checkpoint-700/ --prompt "你是一名数据.开发人员,你精通数据库的sql代码编写,根据用户输入编写sql代码。用户输入:项目基本信息表中电压等级都有哪些" Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.35it/s] SELECT COUNT(prj_tx_ztm_tx_jndztm_tx_mjx FROM un_szhsj.jwd_prj_wdjtxtx_prj_jsj_wdjdwdjwd_jxjs_bxmjtxjj WHERE zxjsj.sjddwdjsj.dxjs_djdwdbdnd== '=';
调用900epoch保存的权重,无法正常推理,加载权重后,一直无反应:
# python inference_hf.py /home/ChatGLM3-20240530/finetune_demo/output/checkpoint-900/ --prompt "你是一名数据.开发人员,你精通数据库的sql代码编写,根据用户输入编写sql代码。用户输入:项目基本信息表中电压等级都有哪些" Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.36it/s]
可以正常推理运行
只有这个ckpt不正常吗
不是只有这个,是靠近训练结束的epoch的权重,随机选了几个都无法正常运行,具体是从哪个epoch开始不能用的,没有定位到。
额,这个错误少见,我今天改了GLM4 GLM3都没遇到,复现不了。。。
System Info / 系統信息
操作系统:centos7 python版本:Python 3.11.9 系统安装cuda版本信息: NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
其它包列表:
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
调用500epoch保存的权重,正常推理:
调用700epoch保存的权重,正常推理:
调用900epoch保存的权重,无法正常推理,加载权重后,一直无反应:
Expected behavior / 期待表现
可以正常推理运行