PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.27k stars 5.6k forks source link

关于bug 62639 #63033

Closed blacksheep-Aristotle closed 7 months ago

blacksheep-Aristotle commented 7 months ago

bug描述 Describe the Bug

关于https://github.com/PaddlePaddle/Paddle/issues/62639 私有格式的处理。

其他补充信息 Additional Supplementary Information

关于https://github.com/PaddlePaddle/Paddle/issues/62639 私有格式的处理。 在用 https://github.com/PaddlePaddle/Paddle/pull/62532 的方法后bug依然存在~

wanghuancoder commented 7 months ago

能否提供最小复现case。如果是厂内同学请如流联系我。

wanghuancoder commented 7 months ago

你可以使用最新的Paddle develop和最新的PaddleCustom试试。我怀疑是stride的问题。

Xiadalei commented 7 months ago

你可以使用最新的Paddle develop和最新的PaddleCustom试试。我怀疑是stride的问题。

这个问题应该不是stride的问题 当前的问题是,现在的save根据这个pr会从直接调用.numpy().cpu().numpy() 这个区别在于,对于.numpy()来说,其调用tensor_method_numpy的bind方法。这个方法里对于customdevice有一个额外的处理,如下所示

VLOG(6) << "Getting DenseTensor's numpy value";
      auto dense_tensor =
          std::dynamic_pointer_cast<phi::DenseTensor>(self->tensor.impl());
      // TODO(qili93): temporary for ascend npu performance to be removed along
      // with npu_identity op
      paddle::Tensor temp_tensor(std::make_shared<phi::DenseTensor>());
      if (dense_tensor->storage_properties_initialized()) {
        temp_tensor = npu_identity_ad_func(self->tensor, -1);
        dense_tensor =
            std::dynamic_pointer_cast<phi::DenseTensor>(temp_tensor.impl());
      }

这里通过npu_identity的op进行了一次格式转换。本质上说是把npu的特殊格式在存储时去除了。 而当我们使用.cpu()时,最终copy使用phi::Copy(),最终在直接调用了customdevice的runtime的memcpyd2h,注意这里没有调用memcpyd2h的kernel,那个是静态图使用的,那么这一步格式转换其实是没有的(.cpu()是动态图才能用的)。 所以一个路径转了格式,一个路径没有转格式。而当载入模型的时候,一般来说我们会默认模型没有特殊格式。所以就会有问题。

blacksheep-Aristotle commented 7 months ago

可以在npu上开启私有格式运行以下代码,看看能不能触发。

import os
import paddle
import numpy as np
import paddle.optimizer as opt
from paddle import nn
from paddle.vision.models import resnet50,mobilenet_v3_large
import time
import paddle.profiler as profiler

BATCH_SIZE = 16
BATCH_NUM = 4
EPOCH_NUM = 1

IMAGE_SIZE = 224
CLASS_NUM = 1000
SEED=1234
paddle.seed(SEED)
np.random.seed(SEED)

print(paddle.device.get_available_custom_device())
def save_model(model,path):
    paddle.save(model.state_dict(), path + ".pdparams")

def load_model(model,path):
    param_state_dict = paddle.load(path + ".pdparams")
    model.set_dict(param_state_dict)

#is this name?
paddle.set_device('npu')

class RandomDataset(paddle.io.Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples

    def __getitem__(self, idx):
        image = np.random.random([3, IMAGE_SIZE, IMAGE_SIZE]).astype('float32')
        label = np.random.randint(0, CLASS_NUM, (1, )).astype('int64')
        return image, label

    def __len__(self):
        return self.num_samples

def train(layer, loader, loss_fn, opt,prof_file='./prof'):

    for epoch_id in range(EPOCH_NUM):

        for batch_id, (images, labels) in enumerate(loader()):
            with paddle.amp.auto_cast(custom_white_list={'elementwise_add'}, level='O1'):
                out = layer(images)
                loss = paddle.nn.functional.cross_entropy(out, labels)
            loss.backward()
            opt.step()
            opt.clear_grad()

def eval(layer, loader, loss_fn, opt,prof_file='./prof'):
    loss_list=[]
    for epoch_id in range(EPOCH_NUM):

        for batch_id, (images, labels) in enumerate(loader()):
            with paddle.amp.auto_cast(custom_white_list={'elementwise_add'}, level='O1'):
                out = layer(images)
                loss = loss_fn(out, labels)
            loss_list.append(loss.numpy())

    return loss_list

dataset = RandomDataset(BATCH_NUM * BATCH_SIZE)
loader = paddle.io.DataLoader(dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    drop_last=True,
    num_workers=2)

def start_train(model,prof_file='./prof'):

    loss_fn = nn.CrossEntropyLoss()
    adam = opt.Adam(learning_rate=0.001, parameters=model.parameters())

    return train(model, loader, loss_fn, adam,prof_file)

def start_eval(mode,prof_file='./prof'):

    loss_fn = nn.CrossEntropyLoss()
    adam = opt.Adam(learning_rate=0.001, parameters=model.parameters())
    model.eval()
    return eval(model, loader, loss_fn, adam,prof_file)

model = resnet50()
test_model=resnet50()

start_train(model)
save_model(model,'2.6hi_best')
load_model(test_model,'2.6hi_best')

model_dict=model.state_dict()
model_load_dict=test_model.state_dict()

for k,v in model_dict.items():
    # try:
    np.testing.assert_allclose(v.numpy(),model_load_dict[k].numpy())
blacksheep-Aristotle commented 7 months ago

你可以使用最新的Paddle develop和最新的PaddleCustom试试。我怀疑是stride的问题。 ^^^

wanghuancoder commented 7 months ago

好的好的。4.8前有个紧急任务,稍晚我来解决这个问题。感谢您的问题反馈!

wanghuancoder commented 7 months ago

@blacksheep-Aristotle @Xiadalei 根据 @Xiadalei 对npu_identity_ad_func的介绍。 我提了PR尝试修复该问题。但我手里没有NPU机器。麻烦有时间帮忙测试一下。如果有问题,能否帮忙调试调试?非常感谢!

wanghuancoder commented 7 months ago

PR已经合入,如未能解决问题,请再提新issue反馈。防止我没有看见两位的留言。