ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Failed Eval when model training is in progress #722

Open zhangpengshan opened 4 years ago

zhangpengshan commented 4 years ago

We'd better to have the feature when in training but can run eval in parallel. Training is usually a long time running job and sometimes in the middle of running, eval can be leveraged to check model performance till checkpointed model.

2020-08-13 08:38:55: ERROR EvalModelProcessor [Eval-MAR] - Exception in eval: java.io.FileNotFoundException: File models does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:444) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1548) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1591) at ml.shifu.shifu.util.ModelSpecLoaderUtils.findModels(ModelSpecLoaderUtils.java:541) at ml.shifu.shifu.util.ModelSpecLoaderUtils.locateBasicModels(ModelSpecLoaderUtils.java:359) at ml.shifu.shifu.util.ModelSpecLoaderUtils.loadBasicModels(ModelSpecLoaderUtils.java:217) at ml.shifu.shifu.core.processor.EvalModelProcessor.validateEvalColumnConfig(EvalModelProcessor.java:794) at ml.shifu.shifu.core.processor.EvalModelProcessor.runEval(EvalModelProcessor.java:849) at ml.shifu.shifu.core.processor.EvalModelProcessor.access$200(EvalModelProcessor.java:57) at ml.shifu.shifu.core.processor.EvalModelProcessor$2.run(EvalModelProcessor.java:695)