Error while loading the model

damiankucharski commented 1 year ago

Hello @YixingHuang , I am getting an error while trying to run inference using the command mentioned in https://github.com/YixingHuang/DeepMedicPlus/issues/1

python deepMedicRun -model ./examples/configFiles/deepMedicPlus/model/modelConfig_wide1_deeper.cfg -test ./examples/configFiles/deepMedicPlus/test/testConfig.cfg -load ./examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt -dev cuda0

I am getting the following error:

=========== Loading parameters from specified saved model ===============
Loading parameters from:/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

ERROR: DeepMedic caught exception when trying to load parameters from the given path of a previously saved model.
Two reasons are very likely:
a) Most probably you passed the wrong path. You need to provide the path to the Tensorflow checkpoint, as expected by Tensorflow.
         In the traceback further below, Tensorflow may report this error of type [NotFoundError].
         DeepMedic uses tensorflow checkpoints to save the models. For this, it stores different types of files for every saved timepoint.
         Those files will be by default in ./examples/output/saved_models, and of the form:
         filename.datetime.model.ckpt.data-0000-of-0001 
         filename.datetime.model.ckpt.index 
         filename.datetime.model.ckpt.meta (Maybe this is missing. That's ok.) 
         To load this checkpoint, you have to provide the path, OMMITING the part after the [.ckpt]. I.e., your command should look like:
         python ./deepMedicRun.py -model path/to/model/config -train path/to/train/config -load filename.datetime.model.ckpt 
b) You have created a network of different architecture than the one that is being loaded and Tensorflow fails to match their variables.
         If this is the case, Tensorflow may report it below as error of type [DataLossError]. 
         If you did not mean to change architectures, ensure that you point to the same modelConfig.cfg as used when the saved model was made.
         If you meant to change architectures, then you will have to create your own script to load the parameters from the saved checkpoint, where the script must describe which variables of the new model match the ones from the saved model.
c) The above are "most likely" reasons, but others are possible. Please read the following Tensorflow stacktrace and error report carefully, and debug accordingly...

Traceback (most recent call last):
  File "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/deepmedic/frontEnd/testSession.py", line 111, in run_session
    saver_net.restore(sessionTf, chkpt_fname)
  File "/pstore/data/gbm_pilot/BRAIN_METS/envs/brain_mets_dev/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1407, in restore
    raise ValueError("The passed save_path is not a valid checkpoint: " +
ValueError: The passed save_path is not a valid checkpoint: /home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

I have modified testChannels_t1c.cfg and testPriorChannels_t1c.cfg files with paths to my test cases. I have put the test cases to examples/data_test and examples/data_test_prior.

Do you have an idea what may be causing the issue?

YixingHuang commented 1 year ago

Hi DK @damiankucharski, I have just updated the checkpoint here. Can you update the absolute model path according to your folder (Replace "C:\MachineLearning\") and have a try again? Let me know whether it works.

damiankucharski commented 1 year ago

Hello @YixingHuang, thank you for your swift response. I am still getting an error.

I modified examples/output/saved_models/pretrainedModels/checkpoint this way, so that it points to my local files:

model_checkpoint_path: "/home/kucharsd/Documents/Git/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_sensitivity.model.ckpt"
all_model_checkpoint_paths: "/home/kucharsd/Documents/Git/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_sensitivity.model.ckpt"
all_model_checkpoint_paths: "/home/kucharsd/Documents/Git/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_precision.model.ckpt"

Error message:

=========== Loading parameters from specified saved model ===============
Loading parameters from:/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

ERROR: DeepMedic caught exception when trying to load parameters from the given path of a previously saved model.
Two reasons are very likely:
a) Most probably you passed the wrong path. You need to provide the path to the Tensorflow checkpoint, as expected by Tensorflow.
         In the traceback further below, Tensorflow may report this error of type [NotFoundError].
         DeepMedic uses tensorflow checkpoints to save the models. For this, it stores different types of files for every saved timepoint.
         Those files will be by default in ./examples/output/saved_models, and of the form:
         filename.datetime.model.ckpt.data-0000-of-0001 
         filename.datetime.model.ckpt.index 
         filename.datetime.model.ckpt.meta (Maybe this is missing. That's ok.) 
         To load this checkpoint, you have to provide the path, OMMITING the part after the [.ckpt]. I.e., your command should look like:
         python ./deepMedicRun.py -model path/to/model/config -train path/to/train/config -load filename.datetime.model.ckpt 
b) You have created a network of different architecture than the one that is being loaded and Tensorflow fails to match their variables.
         If this is the case, Tensorflow may report it below as error of type [DataLossError]. 
         If you did not mean to change architectures, ensure that you point to the same modelConfig.cfg as used when the saved model was made.
         If you meant to change architectures, then you will have to create your own script to load the parameters from the saved checkpoint, where the script must describe which variables of the new model match the ones from the saved model.
c) The above are "most likely" reasons, but others are possible. Please read the following Tensorflow stacktrace and error report carefully, and debug accordingly...

Traceback (most recent call last):
  File "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/deepmedic/frontEnd/testSession.py", line 111, in run_session
    saver_net.restore(sessionTf, chkpt_fname)
  File "/pstore/data/gbm_pilot/BRAIN_METS/envs/brain_mets_dev/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1407, in restore
    raise ValueError("The passed save_path is not a valid checkpoint: " +
ValueError: The passed save_path is not a valid checkpoint: /home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

YixingHuang commented 1 year ago

Oh I just noticed that "The passed save_path is not a valid checkpoint" was already there in your first error message. Can you please double check whether the model path is correct?

YixingHuang commented 1 year ago

@damiankucharski DK In the error message, your model is located in "DeepMedicPlus/DeepMedicPlus/": /home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

In your manually written checkpoint path, however, it is located in "DeepMedicPlus" with only one "DeepMedicPlus" in the complete path. Please double check this.

damiankucharski commented 1 year ago

@YixingHuang I am still getting the error after change.

My checkpoint file now looks like that (I checked the version with // also):

model_checkpoint_path: "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_sensitivity.model.ckpt"
all_model_checkpoint_paths: "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_sensitivity.model.ckpt"
all_model_checkpoint_paths: "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/pretrainedModels/deepMedicWide1.high_precision.model.ckpt"

and the error I think is the same

=========== Loading parameters from specified saved model ===============
Loading parameters from:/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

ERROR: DeepMedic caught exception when trying to load parameters from the given path of a previously saved model.
Two reasons are very likely:
a) Most probably you passed the wrong path. You need to provide the path to the Tensorflow checkpoint, as expected by Tensorflow.
         In the traceback further below, Tensorflow may report this error of type [NotFoundError].
         DeepMedic uses tensorflow checkpoints to save the models. For this, it stores different types of files for every saved timepoint.
         Those files will be by default in ./examples/output/saved_models, and of the form:
         filename.datetime.model.ckpt.data-0000-of-0001 
         filename.datetime.model.ckpt.index 
         filename.datetime.model.ckpt.meta (Maybe this is missing. That's ok.) 
         To load this checkpoint, you have to provide the path, OMMITING the part after the [.ckpt]. I.e., your command should look like:
         python ./deepMedicRun.py -model path/to/model/config -train path/to/train/config -load filename.datetime.model.ckpt 
b) You have created a network of different architecture than the one that is being loaded and Tensorflow fails to match their variables.
         If this is the case, Tensorflow may report it below as error of type [DataLossError]. 
         If you did not mean to change architectures, ensure that you point to the same modelConfig.cfg as used when the saved model was made.
         If you meant to change architectures, then you will have to create your own script to load the parameters from the saved checkpoint, where the script must describe which variables of the new model match the ones from the saved model.
c) The above are "most likely" reasons, but others are possible. Please read the following Tensorflow stacktrace and error report carefully, and debug accordingly...

Traceback (most recent call last):
  File "/home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/deepmedic/frontEnd/testSession.py", line 111, in run_session
    saver_net.restore(sessionTf, chkpt_fname)
  File "/pstore/data/gbm_pilot/BRAIN_METS/envs/brain_mets_dev/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1407, in restore
    raise ValueError("The passed save_path is not a valid checkpoint: " +
ValueError: The passed save_path is not a valid checkpoint: /home/kucharsd/Documents/Git/DeepMedicPlus/DeepMedicPlus/examples/output/saved_models/deepMedicWide1.high_sensitivity.model.ckpt

damiankucharski commented 1 year ago

@YixingHuang I think I found the issue. The command you asked me to run is missing the "/pretrainedModels" part. Now it seems to work however I am getting some graph execution errors. These are probably due to some library versions problems, I will investigate it and see if it solves the issue completely.

YixingHuang commented 1 year ago

Thanks for letting me know. I will update the command accordingly.

YixingHuang / DeepMedicPlus

Error while loading the model #2