Parallelize training in multiple GPUs.

soulios commented 1 month ago

Following the notebook file, I tried to construct a DMPNN model for tox21. Training for some reason takes way longer than the pytorch implementation onn chemprop. So I tried to parallelize it in multiple gpus and got the following error. How can I overcome it? How can I speed up the training? Now it takes under 4' for the chemprop implementation vs the 13:20' for 30 epochs. (The only architectural difference was thet there was lrscheduler in chemprop)


strategy = tf.distribute.MultiWorkerMirroredStrategy()
#strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    tox21 = chemistry.datasets.get("tox21")

    x_train = tox21["train"]["x"]
    y_train = tox21["train"]["y"]
    m_train = tox21["train"]["y_mask"]

    x_val = tox21["validation"]["x"]
    y_val = tox21["validation"]["y"]
    m_val = tox21["validation"]["y_mask"]

    x_test = tox21["test"]["x"]
    y_test = tox21["test"]["y"]
    m_test = tox21["test"]["y_mask"]

    atom_encoder = chemistry.Featurizer([
        chemistry.features.Symbol(),
        chemistry.features.Hybridization(),
        chemistry.features.TotalValence(),
        chemistry.features.Hetero(),
        chemistry.features.HydrogenDonor(),
        chemistry.features.HydrogenAcceptor(),
    ])

    bond_encoder = chemistry.Featurizer([
        chemistry.features.BondType(),
        chemistry.features.Rotatable(),
    ])

    mol_encoder = chemistry.MolecularGraphEncoder(atom_encoder, bond_encoder, positional_encoding_dim=None)

    train_graph = mol_encoder(x_train)
    val_graph = mol_encoder(x_val)
    test_graph = mol_encoder(x_test)

    train_data = (train_graph, y_train, m_train)
    val_data = (val_graph, y_val, m_val)
    test_data = (test_graph, y_test, m_test)

    train_ds = (
        tf.data.Dataset.from_tensor_slices(train_data)
        .shuffle(1024)
        .batch(32)
        .prefetch(-1)
    )

    val_ds = (
        tf.data.Dataset.from_tensor_slices(val_data)
        .batch(32)
        .prefetch(-1)
    )

    test_ds = (
        tf.data.Dataset.from_tensor_slices(test_data)
        .batch(32)
        .prefetch(-1)
    )

    inputs = layers.GNNInput(type_spec=train_graph.spec)

    x = DMPNN(units=300, steps=3, normalization='batch_norm', residual=True)(inputs)
    x = layers.SetGatherReadout()(x)
    z = keras.layers.Dense(units=300, activation="relu")(x)
    z = keras.layers.Dense(units=300, activation="relu")(z)
    outputs = keras.layers.Dense(units=12, activation="sigmoid")(z)

    # Create  and compile the model
    qsar_model = keras.Model(inputs=inputs, outputs=outputs)
    optimizer = keras.optimizers.Adam(
        learning_rate=0.0001
    )

    loss = losses.MaskedBinaryCrossentropy()

    metrics = [
        keras.metrics.AUC(multi_label=True, name="auc"),
    ]

    qsar_model.compile(
        optimizer=optimizer, 
        loss=loss, 
        weighted_metrics=metrics
    )

    callbacks = [
        keras.callbacks.ReduceLROnPlateau(
            monitor="val_auc", patience=10, mode="max"
        ),
        keras.callbacks.EarlyStopping(
            monitor="val_auc", patience=20, mode="max",
            restore_best_weights=True
        ),
    ]

    callbacks += [
        keras.callbacks.TensorBoard(
            log_dir="./logs", histogram_freq=1)
    ]

    qsar_model.fit(
        train_ds,
        callbacks=callbacks,
        validation_data=val_ds,
        epochs=30, 
    )

    bce_loss, auc_score = qsar_model.evaluate(test_ds)

Epoch 1/30 Traceback (most recent call last): File "/gpfs1/schlecker/home/soulios/reproducing-graphs/molgraph/molgraph/train.py", line 128, in qsar_model.fit( File "/gpfs1/schlecker/home/soulios/miniforge3/envs/keras/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "/tmp/autograph_generated_filef80d5yys.py", line 15, in tftrainfunction retval = ag.converted_call(ag__.ld(step_function), (ag.ld(self), ag__.ld(iterator)), None, fscope) TypeError: in user code:

File "/gpfs1/schlecker/home/soulios/miniforge3/envs/keras/lib/python3.10/site-packages/keras/src/engine/training.py", line 1401, in train_function  *
    return step_function(self, iterator)
File "/gpfs1/schlecker/home/soulios/miniforge3/envs/keras/lib/python3.10/site-packages/keras/src/engine/training.py", line 1383, in step_function  **
    data = next(iterator)

TypeError: true_fn and false_fn arguments to tf.cond must have the same number, type, and overall structure of return values.

true_fn output: (GraphTensor(
  sizes=<tf.Tensor: shape=(None,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(None, 131), dtype=float32>,
  edge_src=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(None, 5), dtype=float32>), <tf.Tensor 'cond/cond/Identity_5:0' shape=(None, 12) dtype=float32>, <tf.Tensor 'cond/cond/Identity_6:0' shape=(None, 12) dtype=float32>)
false_fn output: (<tf.Tensor 'cond/cond/Identity:0' shape=(0, 0, 131) dtype=float32>, <tf.Tensor 'cond/cond/Identity_1:0' shape=(0, 12) dtype=float32>, <tf.Tensor 'cond/cond/Identity_2:0' shape=(0, 12) dtype=float32>)

Error details:
The two structures don't have the same nested structure.

First structure: type=tuple str=(GraphTensor(
  sizes=<tf.Tensor: shape=(None,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(None, 131), dtype=float32>,
  edge_src=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(None, 5), dtype=float32>), <tf.Tensor 'cond/cond/Identity_5:0' shape=(None, 12) dtype=float32>, <tf.Tensor 'cond/cond/Identity_6:0' shape=(None, 12) dtype=float32>)

Second structure: type=tuple str=(<tf.Tensor 'cond/cond/Identity:0' shape=(0, 0, 131) dtype=float32>, <tf.Tensor 'cond/cond/Identity_1:0' shape=(0, 12) dtype=float32>, <tf.Tensor 'cond/cond/Identity_2:0' shape=(0, 12) dtype=float32>)

More specifically: Substructure "type=GraphTensor str=GraphTensor(
  sizes=<tf.Tensor: shape=(None,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(None, 131), dtype=float32>,
  edge_src=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(None,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(None, 5), dtype=float32>)" is a sequence, while substructure "type=SymbolicTensor str=Tensor("cond/cond/Identity:0", shape=(0, 0, 131), dtype=float32, device=/job:localhost/replica:0/task:0/device:GPU:0)" is not
Entire first structure:
(., ., .)
Entire second structure:
(., ., .)

akensert commented 1 month ago

Thanks for the feedback @soulios, I'll look into it :)

Is it only DMPNN that is significantly slower? What about e.g., GINConv, GATv2Conv, or perhaps MPNN?

And is it specifically training that is slow? Or is it the generation of the input?

Two quick things to try to speed up the training: comment out TensorBoard, replace SetGatherReadout with (Vanilla) Readout, and replace DMPNN with e.g. MPNN or GIN.

soulios commented 1 month ago

I compared it with chemprop which only has DMPNN, so I cannot tell for the other models. And I use DMPNN because it is more or less the SOTA on the tasks I am interested in(tox21 etc). I was referring only to the training(not the molecular encoding). Thanks I will try these.

soulios commented 1 month ago

Also a bit relevant to the speed, I tried saving and loading using tf_records and and when loading my gpu memory maxes out. Have you encountered such an issue? Several notebooks/examples would be helpful on this as well as pretraining.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

akensert / molgraph

Parallelize training in multiple GPUs. #28