[W AddKernelNpu.cpp:82] Warning: The oprator of add is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
[W VariableFallbackKernel.cpp:51] Warning: CAUTION: The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will fall back to run on the CPU. This may have performance implications. (function npu_cpu_fallback)
torch-npu 2.2.0 `import torch from transformers import AutoModelForCausalLM, AutoTokenizer
device = "npu"
THUDM/glm49bchat
tokenizer = AutoTokenizer.from_pretrained("/home/ma-user/THUDM/glm49bchat",trust_remote_code=True)
query = "你好"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True )
inputs = inputs.npu()
for i,j in inputs.items():
inputs[i] = j.npu()
model = AutoModelForCausalLM.from_pretrained( "/home/ma-user/THUDM/glm49bchat", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True ).npu().eval()
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
input_ids = input_ids.to('npu')
inputs['input_ids'] = inputs['input_ids'].npu() inputs['attention_mask'] = inputs['attention_mask'].npu() inputs['position_ids'] = inputs['position_ids'].npu() with torch.no_grad(): outputs = model.generate(inputs, gen_kwargs) outputs = outputs[:, inputs['input_ids'].shape[1]:] print(tokenizer.decode(outputs[0], skip_special_tokens=True))`
[W AddKernelNpu.cpp:82] Warning: The oprator of add is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator()) [W VariableFallbackKernel.cpp:51] Warning: CAUTION: The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will fall back to run on the CPU. This may have performance implications. (function npu_cpu_fallback)