THUDM / ChatGLM3

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
Apache License 2.0
13.47k stars 1.57k forks source link

RuntimeError: Tensors must be CUDA and dense #808

Closed sevenandseven closed 9 months ago

sevenandseven commented 9 months ago

System Info / 系統信息

absl-py==2.1.0 accelerate==0.25.0 aiofiles==23.2.1 aiohttp==3.9.1 aiosignal==1.3.1 altair==5.2.0 annotated-types==0.6.0 antlr4-python3-runtime==4.9.3 anyio==3.7.1 appdirs==1.4.4 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arxiv==2.1.0 async-timeout==4.0.3 attrs==23.2.0 backoff==2.2.1 beautifulsoup4==4.12.2 blinker==1.7.0 blis==0.7.11 Brotli==1.1.0 cachetools==5.3.2 catalogue==2.0.10 certifi==2023.11.17 cffi==1.16.0 chardet==5.2.0 charset-normalizer==2.0.12 click==8.0.1 cloudpathlib==0.16.0 cmake==3.28.1 colorama==0.4.6 coloredlogs==15.0.1 confection==0.1.4 contourpy==1.2.0 cpm-kernels==1.0.11 cryptography==41.0.7 cuda-python==12.2.0 curl-cffi==0.5.10 cycler==0.12.1 cymem==2.0.8 Cython==3.0.8 dataclasses-json==0.6.3 datasets==2.14.7 deepspeed==0.13.1 Deprecated==1.2.14 deprecation==2.1.0 dill==0.3.7 distro==1.9.0 docker-pycreds==0.4.0 duckduckgo_search==4.1.1 effdet==0.4.1 einops==0.7.0 elastic-transport==8.11.0 elasticsearch==8.11.1 elasticsearch-dsl==8.11.0 emoji==2.9.0 environs==9.5.0 et-xmlfile==1.1.0 exceptiongroup==1.2.0 faiss-cpu==1.7.4 fastapi==0.106.0 feedparser==6.0.10 ffmpy==0.3.1 filelock==3.13.1 filetype==1.2.0

flatbuffers==23.5.26 fonttools==4.47.0 frozenlist==1.4.1 fschat==0.2.34 fsspec==2023.10.0 gitdb==4.0.11 GitPython==3.1.40 google-auth==2.27.0 google-auth-oauthlib==1.2.0 gradio==4.16.0 gradio_client==0.8.1 greenlet==3.0.3 grpcio==1.53.0 grpcio-tools==1.47.5 h11==0.14.0 h2==4.1.0 hjson==3.1.0 hpack==4.0.0 httpcore==1.0.2 httptools==0.6.1 httpx==0.26.0 huggingface-hub==0.20.3 humanfriendly==10.0 hyperframe==6.0.1 idna==3.6 importlib-metadata==6.11.0 importlib-resources==6.1.1 iniconfig==2.0.0 iopath==0.1.10 jieba==0.42.1 Jinja2==3.1.2 joblib==1.3.2 jsonlines==4.0.0 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.20.0 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 langchain==0.0.344 langchain-core==0.0.13 langchain-experimental==0.0.43 langcodes==3.3.0 langdetect==1.0.9 langsmith==0.0.77 latex2mathml==3.77.0 layoutparser==0.3.4 lit==17.0.6 llama-index==0.9.24 lxml==5.0.0 Markdown==3.5.1 markdown-it-py==3.0.0 markdown2==2.4.12 markdownify==0.11.6 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 mdtex2html==1.3.0 mdurl==0.1.2 metaphor-python==0.1.23 milvus-cli==0.3.3 minio==7.2.3 mmh3==3.0.0 mpmath==1.3.0 msg-parser==1.2.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 murmurhash==1.0.10 mypy-extensions==1.0.0 nest-asyncio==1.5.8 networkx==3.2.1 nh3==0.2.15 ninja==1.11.1.1 nltk==3.8.1 numexpr==2.8.8 numpy==1.24.4 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 oauthlib==3.2.2 olefile==0.47 omegaconf==2.3.0 onnx==1.15.0 onnxruntime==1.15.1 openai==1.6.1 opencv-python==4.9.0.80 openpyxl==3.1.2 optimum==1.16.2 orjson==3.9.12 packaging==23.2 pandas==2.0.3 pathlib==1.0.1 pdf2image==1.16.3 pdfminer.six==20221105 pdfplumber==0.10.3 peft==0.7.1 pgvector==0.2.4 Pillow==10.0.1 pluggy==1.3.0 portalocker==2.8.2 preshed==3.0.9 prompt-toolkit==3.0.43 protobuf==3.20.3 psutil==5.9.7 psycopg2==2.9.9 py-cpuinfo==9.0.0 pyarrow==14.0.2 pyarrow-hotfix==0.6 pyasn1==0.5.1 pyasn1-modules==0.3.0 pyclipper==1.3.0.post5 pycocotools==2.0.7 pycparser==2.21 pycryptodome==3.20.0 pydantic==2.6.0 pydantic_core==2.16.1 pydeck==0.8.1b0 pydub==0.25.1 Pygments==2.17.2 pymilvus==2.2.6 PyMuPDF==1.23.8 PyMuPDFb==1.23.7 pynvml==11.5.0 pypandoc==1.12 pyparsing==3.1.1 pypdf==3.17.4 PyPDF2==3.0.1 pypdfium2==4.25.0 pytesseract==0.3.10 pytest==7.4.4 python-dateutil==2.8.2 python-decouple==3.8 python-docx==1.1.0 python-dotenv==1.0.0 python-iso639==2024.1.2 python-magic==0.4.27 python-multipart==0.0.6 python-pptx==0.6.23 pytz==2023.3.post1 PyYAML==6.0.1 rank-bm25==0.2.2 rapidfuzz==3.6.1 rapidocr-onnxruntime==1.3.9 ray==2.9.0 referencing==0.32.0 regex==2023.12.25 requests==2.31.0 requests-oauthlib==1.3.1 rich==13.7.0 rouge-chinese==1.0.3 rpds-py==0.16.2 rsa==4.9 ruamel.yaml==0.18.5 ruamel.yaml.clib==0.2.8 ruff==0.1.15 safetensors==0.4.1 scikit-learn==1.3.2 scipy==1.11.4 semantic-version==2.10.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 sentry-sdk==1.39.2 setproctitle==1.3.3 sgmllib3k==1.0.0 shapely==2.0.2 shellingham==1.5.4 shortuuid==1.0.11 simplejson==3.19.2 six==1.16.0 smart-open==6.4.0 smmap==5.0.1 sniffio==1.3.0 socksio==1.0.0 soupsieve==2.5 spacy==3.7.2 spacy-legacy==3.0.12 spacy-loggers==1.0.5 SQLAlchemy==2.0.19 srsly==2.4.8 sse-starlette==1.8.2 starlette==0.27.0 streamlit==1.29.0 streamlit-aggrid==0.3.4.post3 streamlit-antd-components==0.2.5 streamlit-chatbox==1.1.11 streamlit-feedback==0.1.3 streamlit-modal==0.1.0 streamlit-option-menu==0.3.6 strsimpy==0.2.1 svgwrite==1.4.3 sympy==1.12 tabulate==0.8.9 tenacity==8.2.3 tensorboard==2.15.1 tensorboard-data-server==0.7.2 tensorboardX==2.6.2.2 thinc==8.2.2 threadpoolctl==3.2.0 tiktoken==0.5.2 timm==0.9.12 tokenizers==0.13.3 toml==0.10.2 tomli==2.0.1 tomlkit==0.12.0 toolz==0.12.0 torch==2.0.1 torchaudio==2.0.2 torchvision==0.15.2 tornado==6.4 tqdm==4.66.1 transformers==4.30.2 transformers-stream-generator==0.0.4 triton==2.0.0 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.9.0 tzdata==2023.4 tzlocal==5.2 ujson==5.4.0 unstructured==0.11.0 unstructured-inference==0.7.15 unstructured.pytesseract==0.3.12 urllib3==1.26.18 utils==1.0.2 uvicorn==0.25.0 uvloop==0.19.0 validators==0.22.0 vllm==0.2.1.post1 wandb==0.16.2 wasabi==1.1.2 watchdog==3.0.0 watchfiles==0.21.0 wavedrom==2.0.3.post3 wcwidth==0.2.12 weasel==0.3.4 websockets==11.0.3 Werkzeug==3.0.1 wrapt==1.16.0 xformers==0.0.22 xlrd==2.0.1 XlsxWriter==3.1.9 xxhash==3.4.1 yarl==1.9.4 youtube-search==2.1.2 zipp==3.17.0

Who can help? / 谁可以帮助到您?

Btlmd

I am encountering an error during the LoRa fine-tuning process: After training for 7 rounds, it displays “Starting evaluation,” and then the error mentioned above occurs. How can I solve this?

1 2 3

Information / 问题信息

Reproduction / 复现过程

The instance running is the one provided by the official, finetine_tf.py

with the command :
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=4 finetune_hf.py data/AdvertiseGen_fix/ /WEIGHTS/LLM_Embedding_model/LLM/chatglm3-6b configs/lora.yaml

Expected behavior / 期待表现

I hope to hear from you, thank you.

zRzRzRzRzRzRzR commented 9 months ago

我没有测试使用非deepspeed的多卡进行尝试,只记得使用ds多卡是正常的

sevenandseven commented 9 months ago

请问使用deepspeed进行单机多卡的命令是什么?因为发现finetune_hf.py只接受以下三个参数。 data/AdvertiseGen_fix/ /WEIGHTS/LLM_Embedding_model/LLM/chatglm3-6b configs/lora.yaml

没有--deepspeed参数。

zRzRzRzRzRzRzR commented 9 months ago

是这样运行的 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen_fix THUDM/chatglm3-6b configs/sft.yaml --deepspeed configs/deepspeed.json