Open dskarlov opened 1 month ago
@KnathanM have we seen this before?
It looks like the current convert CLI doesn't work for multicomponent.
Thanks @shihchengli! @dskarlov that converter doesn't work for MulticomponentMPNNs. The development team is focusing on v2.1 development at the moment so we don't have the resources to implement this. You will need to re-train your model in v2.
Now that v2.1 is out, I had time to look into this. @dskarlov could you test this updated v1 to v2 conversion script on your v1 model file to see if it works? https://github.com/KnathanM/chemprop/blob/90ff622c58db55a0783f58c3d3584a007b1d9757/chemprop/utils/v1_to_v2.py#L323
It is on a branch of my fork.
Hi Nathan, I've tested your code and it can convert NN weights without errors. At the same time it has a bit different problem. I cannot use the obtained model and have the following error:
RuntimeError: linear(): input and weight.T shapes cannot be multiplied (3748x86 and 147x300)
It seems the number of input features has changed!
When I trained the v2 I have a correct dimensionality of the input layer - 86x300.
The command I used for training in the first version is:
python chemprop_train \
--data_path train_all.csv \
--smiles_columns smiles solvent \
--dataset_type regression \
--target_columns peakwavs_max \
--loss_function mse \
--separate_test_path test_natural.csv \
--split_type cv-no-test \
--num_folds 5 \
--seed 123 \
--pytorch_seed 42 \
--metric mae \
--extra_metrics rmse \
--cache_cutoff inf \
--save_dir {best_params} \
--batch_size {params["batch_size"]} \
--hidden_size {params["hidden_size"]} \
--activation {params["activation"]} \
--aggregation {params["aggregation"]} \
--depth {params["depth"]} \
--dropout {params["dropout"]} \
--ffn_num_layers {params["ffn_num_layers"]} \
--ffn_hidden_size {params["ffn_hidden_size"]} \
--warmup_epochs {params["warmup_epochs"]} \
--init_lr {params["init_lr"]} \
--max_lr {params["max_lr"]} \
--final_lr {params["final_lr"]} \
--adding_h \
--number_of_molecules 2 \
--gpu 0 \
--epochs 100 \
--ensemble_size 1
And for version 2 :
chemprop train \
--data-path data_all.csv \
--smiles-columns smiles solvent \
--task-type regression \
--target-columns peakwavs_max \
--loss-function mse \
--split cv_no_val \
--splits-column split \
--num-folds 5 \
--data-seed 123 \
--pytorch-seed 42 \
--metric mae rmse \
--save-dir {best_params} \
--batch-size {params["batch_size"]} \
--message-hidden-dim {params["hidden_size"]} \
--activation {params["activation"]} \
--aggregation {params["aggregation"]} \
--depth {params["depth"]} \
--dropout {params["dropout"]} \
--ffn-num-layers {params["ffn_num_layers"]} \
--ffn-hidden-dim {params["ffn_hidden_size"]} \
--warmup-epochs {params["warmup_epochs"]} \
--init-lr {params["init_lr"]} \
--max-lr {params["max_lr"]} \
--final-lr {params["final_lr"]} \
--add-h \
--accelerator gpu \
--devices auto \
--epochs 100 \
--ensemble-size 1
params
dictionary is the same for both cases
So what do you think?
Yes, that is a good point. The default atom featurizer was changed going from v1 to v2. You will probably need --multi-hot-atom-featurizer-mode v1
. If this works for you, I'll add a warning message to the conversion script telling users that they probably need to add this flag and then open a PR to bring this into main Chemprop.
As background: The V1 atom featurizer reserved a bit for all atoms 1-100 by default. This is way more than is usually needed and led to larger models. We have reduced it to the first four rows of the periodic table plus iodine.
Thanks Nathan, it seems to work formally and producing outputs but, unfortunately, no correlation for the test set (experiment and predictions) is observed in contrast to v1. Although the predicted values make sense - the scaler works. There is probably something else there but I do not know where else to dig. Could anything else have changed like order of vectors concatentation after aggregation, etc.??
I ran some tests today and found that v1 and v2 gave me the same predictions, so I'm not sure why you see a difference. Could you repeat my experiment to see if it works for you?
mol+mol.csv
.
from chemprop.args import TrainArgs
from chemprop.models.model import MoleculeModel
from chemprop.utils import save_checkpoint
import pandas as pd
from chemprop import data
args = TrainArgs().parse_args(["--data_path", "...", "--dataset_type", "regression", "--number_of_molecules", "2", ]) args.task_names = ["hack to get num_tasks=1"] model = MoleculeModel(args) save_checkpoint("test_v1_mol+mol.pt", model, args=args)
df = pd.read_csv("mol+mol.csv") smiles = [[a,b] for a,b in zip(df["smiles"], df["solvent"])] datapoints = [data.MoleculeDatapoint(smiles=smile) for smile in smiles] dataset = data.MoleculeDataset(datapoints) dataloader = data.MoleculeDataLoader(dataset, batch_size=4) for batch in dataloader: bmg = batch.batch_graph() break
model(bmg)
3. Load chemprop v2.1 and checkout [my branch](https://github.com/KnathanM/chemprop/blob/better_v1_v2_conversion/chemprop/utils/v1_to_v2.py)
4. Run this in a notebook:
from chemprop import data, models, nn, featurizers import torch import pandas as pd from chemprop.utils.v1_to_v2 import convert_model_file_v1_to_v2
convert_model_file_v1_to_v2("test_v1_mol+mol.pt", "test_v2_mol+mol.pt") model = models.MulticomponentMPNN.load_from_file("test_v2_mol+mol.pt")
df = pd.read_csv("mol+mol.csv") featurizer = featurizers.SimpleMoleculeMolGraphFeaturizer(atom_featurizer=featurizers.MultiHotAtomFeaturizer.v1(), bond_featurizer=featurizers.MultiHotBondFeaturizer()) datapoints1 = [data.MoleculeDatapoint.from_smi(smile) for smile in df.smiles] dataset1 = data.MoleculeDataset(datapoints1, featurizer=featurizer) datapoints2 = [data.MoleculeDatapoint.from_smi(smile) for smile in df.solvent] dataset2 = data.MoleculeDataset(datapoints2, featurizer=featurizer) dataset = data.MulticomponentDataset([dataset1, dataset2]) dataloader = data.build_dataloader(dataset, batchsize=4, shuffle=False) for batch in dataloader: bmg, * = batch break
model(bmg)
This is a minimal example and doesn't include target scaling. But it shows that a basic multicomponent model is the same between v1 and v2. It is also possible that the error is in the CLI, but that would take another test... See if this works for you first.
Yes, it produced the same number both scripts. Which is good!
The prediction using command line is done: 1.7.1
python chemprop_predict \
--test_path test_natural.csv \
--number_of_molecules 2 \
--smiles_columns smiles solvent \
--checkpoint_path {best_params}/fold_{fold}/model_0/model.pt \
--preds_path preds_natural_fold{fold}_{params["id"]}.csv
2.1.0
chemprop predict \
--test-path data/test_natural.csv \
--smiles-columns smiles solvent \
--model-path hyperopt/best_params/fold_{fold}/model_0/model_v2.pt \
--multi-hot-atom-featurizer-mode v1 \
--accelerator cpu \
--devices auto \
--preds-path data/preds_natural_fold{fold}_{params["id"]}.csv
Model files model.pt
are converted to model_v2.pt
using you branch and the provided jupyter notebook.
So json file with parameter set I used for training: {"activation": "ReLU", "aggregation": "mean", "batch-size": 70, "message-bias": "", "depth": 4, "dropout": 0.07928777915219476, "ffn-hidden-dim": 100, "ffn-num-layers": 2, "final-lr": 0.0008745233959951397, "message-hidden-dim ": 300, "init-lr": 0.0010170618974826723, "max-lr": 0.008213119797581435, "warmup-epochs": 5, "id": "f8f937d8-e7f3-4073-a91c-e50fa78313d7"}
Interestingly, chemprop v2 produced three ffn layers in the readout phase (when trained from scratch) compared to two in v 1.7.1
Nathan, I found the reason! Addition of --add-h
to command line arguments in the prediction mode solves the issue. Thanks a lot!
I think we can close this thread.
Interestingly, chemprop v2 produced three ffn layers in the readout phase (when trained from scratch) compared to two in v 1.7.1
The default for v2 is a single hidden ffn layer, meaning:
RegressionFFN(
(ffn): MLP(
(0): Sequential(
(0): Linear(in_features=300, out_features=300, bias=True)
)
(1): Sequential(
(0): ReLU()
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=300, out_features=1, bias=True)
)
)
(criterion): MSE(task_weights=[[1.0]])
(output_transform): Identity()
)
I think v1 used two hidden layers but could be wrong.
I am surprised you needed to include --add-h
to the CLI arguments when your v1 arguments didn't include --adding-h
. My understanding is the default didn't change for this.
I think we can close this thread. Glad we could help get this working for you. I'll leave this open until I add a warning to the v1_to_v2 conversion script that users need to use the v1 multihot atom featurizer. Thanks.
RE: the number of ffn hidden layers - the default value is 1 in v2 but 2 in v1.
Given a model trained using chemprop 1.7.1 which uses two SMILES columns. When I switched to chemprop 2.0.5 I converted model files and surprisingly got roughly half of the initial memory required (2.3 MB -> 1.2 MB). When there was an attempt to run predictions the following error was thrown. Is there any chance chemprop.utils.v1_to_v2 does not work correctly (does not convert both NNs for SMILES columns)?