alibaba / MNN

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
http://www.mnn.zone/
8.71k stars 1.67k forks source link

训练时sgd->step(loss)返回false #2161

Closed Iseulf closed 9 months ago

Iseulf commented 1 year ago

环境是WSL Ubuntu-20.04,编译的源码是mnn 2.1.0,模型是我自己的经过转换的模型。 代码可以正常运行 但sgd->step(loss)这一步返回值为false,也即无法进行正常训练。 这是训练函数的源码:

void train(vector<vector<vector<float>>> data, vector<int> labels, std::shared_ptr<Module> model, int echoes, int numClasses) {
    auto exe = Executor::getGlobalExecutor();
    BackendConfig config;
    exe->setGlobalExecutorConfig(MNN_FORWARD_AUTO, config, 2);
    std::shared_ptr<SGD> solver(new SGD(model));
    solver->setMomentum(0.0f);
    solver->setWeightDecay(0.0f);
    solver->setLearningRate(0.01f);

    DatasetPtr dataset = TimeDataset::create(data, labels);
    auto dataLoader = dataset.createLoader(50, true, false);
    const int iters = dataLoader->iterNumber();

    // DatasetPtr testDataset = TimeDataset::create(data, labels);
    // auto testDataLoader = testDataset.createLoader(50, true, false);
    // const int iters = testdataLoader->iterNumber();

    model->setIsTraining(true);
    for (int i = 0; i < echoes; i++) {
        // std::cout << i << std::endl;
        model->clearCache();
        exe->gc(Executor::FULL);
        exe->resetProfile();
        {
            AUTOTIME;
            dataLoader->reset();
            for (int j = 0; j < iters; j++) {
                // AUTOTIME;
                auto trainData = dataLoader->next();

                auto example = trainData[0];
                // std::cout << example.second[0]->readMap<int>()[0]<<" ";

                // Compute One-Hot
                auto newTarget = _OneHot(_Cast<int32_t>(_Squeeze(example.second[0])),
                                         _Scalar<int>(numClasses), _Scalar<float>(1.0f),
                                         _Scalar<float>(0.0f));

                auto predict = model->forward(_Convert(example.first[0], Dimensionformat::NHWC));
                auto loss = _CrossEntropy(predict, newTarget);
                // float rate   = LrScheduler::inv(0.0001, solver->currentStep(), 0.0001, 0.75);
                // float rate = 1e-2;
                // solver->setLearningRate(rate);
                if (solver->currentStep() % 10 == 0) {
                    std::cout << example.second[0]->readMap<int>()[0] << " ";
                    std::cout << " train iteration: " << solver->currentStep();
                    std::cout << " loss: " << loss->readMap<float>()[0] << std::endl;
                    // std::cout << " lr: " << rate << std::endl;
                }

                // std::cout << "Hello\n";
                std::cout << solver->step(loss) << " ";
            }
        }
        std::cout << std::endl;
    }
    exe->dumpProfile();
    // model->setIsTraining(false);
}
yyfcc17 commented 1 year ago

使用最新代码试试?跟进去看看,错在哪个地方

Iseulf commented 1 year ago

我的版本是2.2.1,目前解决方法是使用mnn提供的c++算子,自己实现了网络模型,训练正常。之后会尝试一下最新版本,如果有问题会再反馈的发自我的手机-------- 原始邮件 --------发件人: yafeng @.>日期: 2023年3月6日周一 下午3:31收件人: alibaba/MNN @.>抄送: Iseulf @.>, Author @.>主 题: Re: [alibaba/MNN] 训练时sgd->step(loss)返回false (Issue #2161)

使用最新代码试试?跟进去看看,错在哪个地方

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

github-actions[bot] commented 9 months ago

Marking as stale. No activity in 60 days.