Closed puppyjn closed 1 month ago
系统、显卡驱动、paddlepaddle、paddlenlp已多次重装。除非主机或者显卡有问题。
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
bug描述 Describe the Bug
环境: Vmare Esxi 7.0.3 虚拟环境,ubuntu 22.04 桌面版 (服务器版同样现象) 显卡Tesla P100 16G直通, nvida驱动 535 cuda 12.2 cudnn 8.9 (驱动520,cuda11.8同样现象) paddlepaddle_gpu==2.6.1_post10 (2.6同样现象) paddlenlp=3.0 (2.7同样现象) python -c "import paddle; paddle.utils.run_check()" 检测正常
现象: 运行最简单的例子,报错。
说明: 1、其他taskflow工作都不正常,要么输出为空,要么是其他异常。 2、同样GPU环境下, 使用xinference(pytorch推导Embedding、Rerank、LLM模型正常)。 3、换一台同型号CPU主机,使用1080 和 3090 工作正常。因为这台问题主机在客户处,无法换卡验证 4、#55571 issue 也提交过同样的BUG,但无法重现,被Close。
报错:
其他补充信息 Additional Supplementary Information
lscpu:输出
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 Stepping: 4 BogoMIPS: 4999.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon n opl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs i bpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdsee d adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xs aves arat pku ospke md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: VMware Virtualization type: full Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 8 MiB (8 instances) L3: 55 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected
python -c "import paddle; paddle.utils.run_check()" 输出:
Running verify PaddlePaddle program ... I0628 16:10:36.952970 30728 program_interpreter.cc:212] New Executor is Running. W0628 16:10:36.953351 30728 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 90 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0628 16:10:36.953377 30728 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 12.2, Runtime API Version: 12.0 W0628 16:10:36.954388 30728 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9. I0628 16:10:37.040109 30728 interpreter_util.cc:624] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
文本分类错误提示
Traceback (most recent call last): File "", line 1, in
File "/root/paddlenlp/PaddleNLP/paddlenlp/taskflow/taskflow.py", line 822, in call
results = self.task_instance(inputs, kwargs)
File "/root/paddlenlp/PaddleNLP/paddlenlp/taskflow/task.py", line 527, in call
outputs = self._run_model(inputs, kwargs)
File "/root/paddlenlp/PaddleNLP/paddlenlp/taskflow/lexical_analysis.py", line 219, in _run_model
self.predictor.run()
ValueError: In user code: