Trying to train on gen2 dataset

GhermanProteic commented 2 years ago

Hi,

I am trying to overfit espaloma to a small batch from gen2 dataset. I noticed that the reference energy u_ref is in large negative scale:

ds = esp.data.dataset.GraphDataset.load("gen2")
ds = ds[:10]
ds.shuffle(seed=2666)
ds_tr, ds_vl, ds_te = ds.split([5, 3, 2])

ds_tr_loader = ds_tr.view(batch_size=1, shuffle=True)
ds_vl_loader = ds_vl.view(batch_size=1, shuffle=True)

g_tr = next(iter(ds_tr.view(batch_size=1)))

g_tr = next(iter(ds_tr.view(batch_size=1)))
torch.mean(g_tr.nodes["g"].data['u_ref'], dim=1)
tensor([-1988.4373])

(side note, when I try to increase the batch size I get the following error)

Expect all graphs to have the same schema on nodes["g"].data, but graph 1 got
    {'u_openff-1.2.0': Scheme(shape=(29,), dtype=torch.float32), 'u_gaff-2.11': Scheme(shape=(29,), dtype=torch.float32), 'u_qm': Scheme(shape=(29,), dtype=torch.float32), 'u_ref': Scheme(shape=(29,), dtype=torch.float32), 'u_gaff-1.81': Scheme(shape=(29,), dtype=torch.float32)}
which is different from
    {'u_openff-1.2.0': Scheme(shape=(77,), dtype=torch.float32), 'u_gaff-2.11': Scheme(shape=(77,), dtype=torch.float32), 'u_qm': Scheme(shape=(77,), dtype=torch.float32), 'u_ref': Scheme(shape=(77,), dtype=torch.float32), 'u_gaff-1.81': Scheme(shape=(77,), dtype=torch.float32)}.

The model that I am using is initialized the following way:

representation = esp.nn.Sequential(
    layer=esp.nn.layers.dgl_legacy.gn("SAGEConv"), # use SAGEConv implementation in DGL
    config=[128, "relu", 128, "relu", 128, "relu"], # 3 layers, 128 units, ReLU activation
)

readout = esp.nn.readout.janossy.JanossyPooling(
    in_features=128, config=[128, "relu", 128, "relu", 128, "relu"],
    out_features={              # define modular MM parameters Espaloma will assign
        1: {"e": 1, "s": 1}, # atom hardness and electronegativity
        2: {"log_coefficients": 2}, # bond linear combination, enforce positive
        3: {"log_coefficients": 2}, # angle linear combination, enforce positive
        4: {"k": 6}, # torsion barrier heights (can be positive or negative)
    },
)

espaloma_model = torch.nn.Sequential(
                 representation, readout, 
                 esp.nn.readout.janossy.ExpCoefficients(),
                 esp.mm.geometry.GeometryInGraph(),
                 esp.mm.energy.EnergyInGraph(),
                 #esp.mm.energy.EnergyInGraph(suffix="_ref"),
                 esp.nn.readout.charge_equilibrium.ChargeEquilibrium(),
)

Now I am trying to overfit and I am training the following model:

normalize = esp.data.normalize.ESOL100LogNormalNormalize()
for idx_epoch in tqdm(range(2000)):
    intr_loss = 0
    k = 0
    for g in ds_tr_loader:
        optimizer.zero_grad()
        if torch.cuda.is_available():
            g = g.to("cuda:0")
        g = espaloma_model(g)
        g = normalize.unnorm(g)

        loss = loss_fn(g)
        loss.backward()
        optimizer.step()
        intr_loss += loss.item()

I am using the following loss function:

loss_fn = esp.metrics.GraphMetric(
        base_metric=torch.nn.MSELoss(),
        between=['u', "u_ref"],
        level="g",)

After training the train loss plot looks the following (epochs on the x-axis):

newplot (28)

The loss gets stuck at ~1.4M when you would expect it to be close to 0 (since I am training only on 5 examples). The energy for individual examples converges at some small positive value:

g_tr = espaloma_model(g_tr)
g_tr.nodes["g"].data["u"]
1.2619

If I do the same but on pepconf dataset (peptides) and get similar results. The output of espaloma is on a different scale.

My question is, what am I doing wrong? Is it the model architecture? The normalizer? Or smth else? Would appreciate any help.

Thanks!

tueboesen commented 2 years ago

I have essentially the same problem.

I followed the training method suggested here: https://espaloma.wangyq.net/experiments/qm_fitting.html But the loss is basically in the millions from what I can see, and it looks like there is some kind of normalization missing or something in the procedure?

yuanqing-wang commented 2 years ago

sorry about this. it is caused by an outdated API that I needed to remove. I'll update the example in a day or two.

yuanqing-wang commented 2 years ago

The updated Colab notebook should work! http://data.wangyq.net/esp_notesbooks/qm_fitting.ipynb

Essentially the energies need to be centered before calculating the error.

qcxia20 commented 2 years ago

@yuanqing-wang , Hi, I want to ask several questions based on the above discussion in this issue here.

I also followed the code in colab (to train on gen2 dataset u_ref but only for 200 epochs). After training, I use the each of the 200 models to predict the energy of a molecule in gen2 dataset and plot here. As you can see, the energy keeps oscillating (I don't know whether it is the issue because I just train for 200 epochs but I think the model is going to converge). I also compared the predicted energies with the energy given in gen2 dataset as below and found neither of them can be comparable in absolute value with the predicted energies. During the training, the total energy is trained on .nodes["g"].data["u_ref"] for each molecule, which in my view should be something related with QM calculation and consider the n2+n3+n4_proper format energy given by the model. But the loss function is defined through a normalization process esp.metrics.center(), which means the model can only learn the relative energy from the fitted data. Since that parameters in traditional FF are fitted in the aim to reproduce experimental energy, I am confused.

In summary, here are 2 main questions:

What's the meaning/source of the energies saved in gen2 dataset, I mean, "u_ref", "u_qm", "u_gaff-1.81", and "u_openff-1.2.0". Is there any relationship between "u_ref" and "u_qm"? Why the training was fitted on "u_ref" instead of "u_qm"?
What's the physical meaning of the predicted energy "u"? Is the energy can be compatible to any existing force field?

molecule	energy predicted

>>> dataset_name = "gen2"
>>> ds = esp.data.dataset.GraphDataset.load(dataset_name)

>>> espaloma_model.load_state_dict(torch.load("100.th", map_location=torch.device("cpu")))
>>> espaloma_model(ds[5].heterograph)
>>> print(ds[5].heterograph.nodes["g"].data["u"][:,0:3])
# tensor([[7.8801, 7.8812, 7.8807]], grad_fn=<SliceBackward0>)

>>> espaloma_model.load_state_dict(torch.load("199.th", map_location=torch.device("cpu")))
>>> espaloma_model(ds[5].heterograph)
>>> print(ds[5].heterograph.nodes["g"].data["u"][:,0:3])
# tensor([[8.2513, 8.2540, 8.2544]], grad_fn=<SliceBackward0>)

>>> print(ds[5].heterograph.nodes["g"].data["u_ref"][:,0:3])
# tensor([[-752.6477, -752.6463, -752.6459]])

>>> print(ds[5].heterograph.nodes["g"].data['u_qm'][:,0:3])
# tensor([[-752.6166, -752.6210, -752.6230]])

>>> print(ds[5].heterograph.nodes["g"].data["u_gaff-1.81"][:,0:3])
# tensor([[0.0672, 0.0610, 0.0583]])

>>> print(ds[5].heterograph.nodes["g"].data['u_openff-1.2.0'][:,0:3])
# tensor([[0.1153, 0.1101, 0.1089]])

jchodera commented 2 years ago

Reopening this issue to make sure we address the most recent comment!

@yuanqing-wang : Can you be sure to document the units for the saved gen2 data?
@yuanqing-wang : Would be good to document the units here too, as well as which part of the potential function is represented here---is it just the valence and coulomb components?

choderalab / espaloma

Trying to train on gen2 dataset #110