ModelCheckpoint fail to store

cvignac / DiGress

code for the paper "DiGress: Discrete Denoising diffusion for graph generation"

MIT License

314 stars 68 forks source link

ModelCheckpoint fail to store #70

Open se7esx opened 8 months ago

se7esx commented 8 months ago

When I run train_qm9_regressor.py, I can't save the model, the model parameters are not saved to checkpoint_callback.dirpath, I used offline training

Antoninnnn commented 8 months ago

One possible solution is to set the train.save_model to be true in the regressor_model.yaml

se7esx commented 8 months ago

One possible solution is to set the train.save_model to be true in the regressor_model.yaml

thanks, but i follow your setup but still can't store the checkpoint ...

Abusagit commented 7 months ago

Hi! I have faced with exactly the same problem... Trying to solve it now, will report if something works out. For now it seems that some options are duplicated in different config files (e.g. config for model is also written in experiments/<config_name>.yaml). The same goes for training options too. Configs structure seems overcomplicated for me, maybe this is the reason why our problem occurred.

Abusagit commented 7 months ago

UPD: I analyzed the logs and in my case model checkpoints (as well as other run data) were saved in the parental directory, which is determined in config.yaml. Before that, I ran main.py from src directory and thus model outputs were saved in root dir, and it was my wrong interpretation that I should observe next outputs in the root too.

Also it turned out that I didn't switch any save_model parameters, so it worked out of the box. @se7esx do you have any ooutput which only lacks checkpoints or your run produces nothing?

se7esx commented 7 months ago

UPD: I analyzed the logs and in my case model checkpoints (as well as other run data) were saved in the parental directory, which is determined in config.yaml. Before that, I ran main.py from src directory and thus model outputs were saved in root dir, and it was my wrong interpretation that I should observe next outputs in the root too.

Also it turned out that I didn't switch any save_model parameters, so it worked out of the box. @se7esx do you have any ooutput which only lacks checkpoints or your run produces nothing?

I checked the default save path, it was created in a folder called output but it couldn't be saved no matter how much I set the save parameter.