chemprop / chemprop

Message Passing Neural Networks for Molecule Property Prediction
https://chemprop.csail.mit.edu
Other
1.8k stars 589 forks source link

[v2 BUG]: model files do not work after v1_to_v2 conversion #1054

Open dskarlov opened 1 month ago

dskarlov commented 1 month ago

Given a model trained using chemprop 1.7.1 which uses two SMILES columns. When I switched to chemprop 2.0.5 I converted model files and surprisingly got roughly half of the initial memory required (2.3 MB -> 1.2 MB). When there was an attempt to run predictions the following error was thrown. Is there any chance chemprop.utils.v1_to_v2 does not work correctly (does not convert both NNs for SMILES columns)?

poetry run chemprop predict --test-path data/test_natural.csv --smiles-columns smiles solvent --model-path hyperopt/best_params/fold_0/model_0/model_v2.pt  --preds-path data/preds_natural_fold0_{params["id"]}.csv

Traceback (most recent call last):
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/bin/chemprop", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/cli/main.py", line 85, in main
    func(args)
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/cli/predict.py", line 41, in func
    main(args)
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/cli/predict.py", line 350, in main
    make_prediction_for_models(args, model_paths, multicomponent, output_path=args.output)
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/cli/predict.py", line 176, in make_prediction_for_models
    model = load_model(model_paths[0], multicomponent)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/models/utils.py", line 22, in load_model
    model = MulticomponentMPNN.load_from_file(path, map_location=torch.device("cpu"))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dkarlov/Library/Caches/pypoetry/virtualenvs/si-data-fDi9I0YB-py3.11/lib/python3.11/site-packages/chemprop/models/multi.py", line 84, in load_from_file
    for block_hparams in hparam_kwargs["blocks"]
                         ~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'blocks'
JacksonBurns commented 1 month ago

@KnathanM have we seen this before?

shihchengli commented 1 month ago

It looks like the current convert CLI doesn't work for multicomponent.

JacksonBurns commented 1 month ago

Thanks @shihchengli! @dskarlov that converter doesn't work for MulticomponentMPNNs. The development team is focusing on v2.1 development at the moment so we don't have the resources to implement this. You will need to re-train your model in v2.

KnathanM commented 1 month ago

Now that v2.1 is out, I had time to look into this. @dskarlov could you test this updated v1 to v2 conversion script on your v1 model file to see if it works? https://github.com/KnathanM/chemprop/blob/90ff622c58db55a0783f58c3d3584a007b1d9757/chemprop/utils/v1_to_v2.py#L323

It is on a branch of my fork.

dskarlov commented 1 month ago

Hi Nathan, I've tested your code and it can convert NN weights without errors. At the same time it has a bit different problem. I cannot use the obtained model and have the following error: RuntimeError: linear(): input and weight.T shapes cannot be multiplied (3748x86 and 147x300) It seems the number of input features has changed! When I trained the v2 I have a correct dimensionality of the input layer - 86x300.

The command I used for training in the first version is:

python chemprop_train  \
                  --data_path train_all.csv \
                  --smiles_columns smiles solvent \
                  --dataset_type regression \
                  --target_columns peakwavs_max \
                  --loss_function mse \
                  --separate_test_path test_natural.csv \
                  --split_type cv-no-test \
                  --num_folds 5 \
                  --seed 123 \
                  --pytorch_seed 42 \
                  --metric mae \
                  --extra_metrics rmse \
                  --cache_cutoff inf \
                  --save_dir {best_params} \
                  --batch_size {params["batch_size"]} \
                  --hidden_size {params["hidden_size"]} \
                  --activation {params["activation"]} \
                  --aggregation {params["aggregation"]} \
                  --depth {params["depth"]} \
                  --dropout {params["dropout"]} \
                  --ffn_num_layers {params["ffn_num_layers"]} \
                  --ffn_hidden_size {params["ffn_hidden_size"]} \
                  --warmup_epochs {params["warmup_epochs"]} \
                  --init_lr {params["init_lr"]} \
                  --max_lr {params["max_lr"]} \
                  --final_lr {params["final_lr"]} \
                  --adding_h \
                  --number_of_molecules 2 \
                  --gpu 0 \
                  --epochs 100 \
                  --ensemble_size 1

And for version 2 :

chemprop train  \
                  --data-path data_all.csv \
                  --smiles-columns smiles solvent \
                  --task-type regression \
                  --target-columns peakwavs_max \
                  --loss-function mse \
                  --split cv_no_val \
                  --splits-column split \
                  --num-folds 5 \
                  --data-seed 123 \
                  --pytorch-seed 42 \
                  --metric mae rmse \
                  --save-dir {best_params} \
                  --batch-size {params["batch_size"]} \
                  --message-hidden-dim {params["hidden_size"]} \
                  --activation {params["activation"]} \
                  --aggregation {params["aggregation"]} \
                  --depth {params["depth"]} \
                  --dropout {params["dropout"]} \
                  --ffn-num-layers {params["ffn_num_layers"]} \
                  --ffn-hidden-dim {params["ffn_hidden_size"]} \
                  --warmup-epochs {params["warmup_epochs"]} \
                  --init-lr {params["init_lr"]} \
                  --max-lr {params["max_lr"]} \
                  --final-lr {params["final_lr"]} \
                  --add-h \
                  --accelerator gpu \
                  --devices auto \
                  --epochs 100 \
                  --ensemble-size 1 

params dictionary is the same for both cases So what do you think?

KnathanM commented 1 month ago

Yes, that is a good point. The default atom featurizer was changed going from v1 to v2. You will probably need --multi-hot-atom-featurizer-mode v1. If this works for you, I'll add a warning message to the conversion script telling users that they probably need to add this flag and then open a PR to bring this into main Chemprop.

As background: The V1 atom featurizer reserved a bit for all atoms 1-100 by default. This is way more than is usually needed and led to larger models. We have reduced it to the first four rows of the periodic table plus iodine.

dskarlov commented 1 month ago

Thanks Nathan, it seems to work formally and producing outputs but, unfortunately, no correlation for the test set (experiment and predictions) is observed in contrast to v1. Although the predicted values make sense - the scaler works. There is probably something else there but I do not know where else to dig. Could anything else have changed like order of vectors concatentation after aggregation, etc.??

KnathanM commented 1 month ago

I ran some tests today and found that v1 and v2 gave me the same predictions, so I'm not sure why you see a difference. Could you repeat my experiment to see if it works for you?

  1. Save this file as mol+mol.csv.
  2. On a fresh install of v1.7,1 run this in a notebook:
    
    from chemprop.args import TrainArgs
    from chemprop.models.model import MoleculeModel
    from chemprop.utils import save_checkpoint
    import pandas as pd
    from chemprop import data

args = TrainArgs().parse_args(["--data_path", "...", "--dataset_type", "regression", "--number_of_molecules", "2", ]) args.task_names = ["hack to get num_tasks=1"] model = MoleculeModel(args) save_checkpoint("test_v1_mol+mol.pt", model, args=args)

df = pd.read_csv("mol+mol.csv") smiles = [[a,b] for a,b in zip(df["smiles"], df["solvent"])] datapoints = [data.MoleculeDatapoint(smiles=smile) for smile in smiles] dataset = data.MoleculeDataset(datapoints) dataloader = data.MoleculeDataLoader(dataset, batch_size=4) for batch in dataloader: bmg = batch.batch_graph() break

model(bmg)

3. Load chemprop v2.1 and checkout [my branch](https://github.com/KnathanM/chemprop/blob/better_v1_v2_conversion/chemprop/utils/v1_to_v2.py)
4. Run this in a notebook:

from chemprop import data, models, nn, featurizers import torch import pandas as pd from chemprop.utils.v1_to_v2 import convert_model_file_v1_to_v2

convert_model_file_v1_to_v2("test_v1_mol+mol.pt", "test_v2_mol+mol.pt") model = models.MulticomponentMPNN.load_from_file("test_v2_mol+mol.pt")

df = pd.read_csv("mol+mol.csv") featurizer = featurizers.SimpleMoleculeMolGraphFeaturizer(atom_featurizer=featurizers.MultiHotAtomFeaturizer.v1(), bond_featurizer=featurizers.MultiHotBondFeaturizer()) datapoints1 = [data.MoleculeDatapoint.from_smi(smile) for smile in df.smiles] dataset1 = data.MoleculeDataset(datapoints1, featurizer=featurizer) datapoints2 = [data.MoleculeDatapoint.from_smi(smile) for smile in df.solvent] dataset2 = data.MoleculeDataset(datapoints2, featurizer=featurizer) dataset = data.MulticomponentDataset([dataset1, dataset2]) dataloader = data.build_dataloader(dataset, batchsize=4, shuffle=False) for batch in dataloader: bmg, * = batch break

model(bmg)



This is a minimal example and doesn't include target scaling. But it shows that a basic multicomponent model is the same between v1 and v2. It is also possible that the error is in the CLI, but that would take another test... See if this works for you first.
dskarlov commented 4 weeks ago

Yes, it produced the same number both scripts. Which is good!

The prediction using command line is done: 1.7.1

python chemprop_predict \
                          --test_path test_natural.csv \
                          --number_of_molecules 2 \
                          --smiles_columns smiles solvent \
                          --checkpoint_path {best_params}/fold_{fold}/model_0/model.pt \
                          --preds_path preds_natural_fold{fold}_{params["id"]}.csv

2.1.0

chemprop predict \
                          --test-path data/test_natural.csv \
                          --smiles-columns smiles solvent \
                          --model-path hyperopt/best_params/fold_{fold}/model_0/model_v2.pt \
                          --multi-hot-atom-featurizer-mode v1 \
                          --accelerator cpu \
                          --devices auto \
                          --preds-path data/preds_natural_fold{fold}_{params["id"]}.csv

Model files model.pt are converted to model_v2.pt using you branch and the provided jupyter notebook.

So json file with parameter set I used for training: {"activation": "ReLU", "aggregation": "mean", "batch-size": 70, "message-bias": "", "depth": 4, "dropout": 0.07928777915219476, "ffn-hidden-dim": 100, "ffn-num-layers": 2, "final-lr": 0.0008745233959951397, "message-hidden-dim ": 300, "init-lr": 0.0010170618974826723, "max-lr": 0.008213119797581435, "warmup-epochs": 5, "id": "f8f937d8-e7f3-4073-a91c-e50fa78313d7"}

Interestingly, chemprop v2 produced three ffn layers in the readout phase (when trained from scratch) compared to two in v 1.7.1

dskarlov commented 4 weeks ago

Nathan, I found the reason! Addition of --add-h to command line arguments in the prediction mode solves the issue. Thanks a lot! I think we can close this thread.

KnathanM commented 4 weeks ago

Interestingly, chemprop v2 produced three ffn layers in the readout phase (when trained from scratch) compared to two in v 1.7.1

The default for v2 is a single hidden ffn layer, meaning:

RegressionFFN(
  (ffn): MLP(
    (0): Sequential(
      (0): Linear(in_features=300, out_features=300, bias=True)
    )
    (1): Sequential(
      (0): ReLU()
      (1): Dropout(p=0.0, inplace=False)
      (2): Linear(in_features=300, out_features=1, bias=True)
    )
  )
  (criterion): MSE(task_weights=[[1.0]])
  (output_transform): Identity()
)

I think v1 used two hidden layers but could be wrong.

I am surprised you needed to include --add-h to the CLI arguments when your v1 arguments didn't include --adding-h. My understanding is the default didn't change for this.

I think we can close this thread. Glad we could help get this working for you. I'll leave this open until I add a warning to the v1_to_v2 conversion script that users need to use the v1 multihot atom featurizer. Thanks.

shihchengli commented 4 weeks ago

RE: the number of ffn hidden layers - the default value is 1 in v2 but 2 in v1.