MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
148 stars 48 forks source link

Restraint indices out of bounds in Trivalent Nitrogen Set 3 #562

Closed btjanaka closed 4 years ago

btjanaka commented 4 years ago

Describe the bug

I am trying to access the atoms involved in the optimizations in the "OpenFF Trivalent Nitrogen Set 3" dataset. I am able to access the indices, and I am able to access the molecule, but I find that the indices are out of bounds; i.e. the molecule seems to be lacking atoms. For instance, one of the molecules might have restraint indices [0, 13, 2, 1] but the molecule would only have 9 atoms, whereas it would need at least 14 here (due to zero-based indexing).

I have tried converting the SMILES for the molecule to Openey's oechem.OEMol and accessing the atoms in the resulting molecule, but the indices do not seem to match up with those in the qcelemental molecule. I think the atom at the first restraint index should be the trivalent nitrogen, but this is not always the case when using the OEMol.

To Reproduce

The following code prints out the molecules that I have been having issues with.

import numpy as np
import qcportal as ptl
from openeye import oechem, oedepict

# Loading Data
client = ptl.FractalClient()
ds = client.get_collection('GridOptimizationDataset',
                           "OpenFF Trivalent Nitrogen Set 3")
molecule_ids = [ds.get_entry(index).initial_molecule for index in ds.df.index]
molecules = client.query_molecules(molecule_ids)
smiles_strings = [
    ds.get_entry(index).attributes["canonical_explicit_hydrogen_smiles"]
    for index in ds.df.index
]
grid_opt_record_ids = [
    ds.get_entry(index).object_map['default'] for index in ds.df.index
]
grid_opt_records = client.query_procedures(grid_opt_record_ids)

# Looking at the restraint indices in each record, and seeing whether they are
# available in the molecule
for record, molecule, smiles in zip(grid_opt_records, molecules,
                                    smiles_strings):
    restraint_indices = np.array(record.keywords.scans[0].__dict__['indices'])
    num_atoms = len(molecule.symbols)

    # If any restraint index is out of bounds
    if np.any(restraint_indices >= num_atoms):
        print("--------------------------------------")
        print("GridOptRecord id :", record.id)
        print("SMILES           :", smiles)
        print("Restraint indices:", restraint_indices)
        print("Number of atoms  :", num_atoms)

        mol = oechem.OEMol()
        oechem.OESmilesToMol(mol, smiles)
        idx_to_symbol = dict(
            (atom.GetIdx(), oechem.OEGetAtomicSymbol(atom.GetAtomicNum()))
            for atom in mol.GetAtoms())
        print("Restraint atoms in OEMol:",
              [idx_to_symbol[idx] for idx in restraint_indices])

Expected behavior

The restraint indices should correspond to the correct atoms in the molecules from the dataset, and the output for the code above should be empty. Instead, I am getting the following output:

--------------------------------------
GridOptRecord id : 17247212
SMILES           : [H]C(=O)C([H])([H])N([H])C([H])([H])C(=O)[H]
Restraint indices: [ 4 13  2  3]
Number of atoms  : 11
Restraint atoms in OEMol: ['H', 'H', 'O', 'C']
--------------------------------------
GridOptRecord id : 17247224
SMILES           : [H]c1c(nc(c(n1)[H])N(C([H])([H])[H])C([H])([H])[H])[H]
Restraint indices: [8 3 4 5]
Number of atoms  : 7
Restraint atoms in OEMol: ['N', 'N', 'C', 'C']
--------------------------------------
GridOptRecord id : 17247254
SMILES           : [H]c1c(c(c(c(c1[H])[H])N([H])C([H])([H])[H])[H])[H]
Restraint indices: [ 7 16  5  6]
Number of atoms  : 9
Restraint atoms in OEMol: ['H', 'H', 'C', 'C']
--------------------------------------
GridOptRecord id : 17247255
SMILES           : [H]c1c(nc(c(c1N([H])C([H])([H])[H])[H])[H])[H]
Restraint indices: [ 7 15  4  5]
Number of atoms  : 15
Restraint atoms in OEMol: ['N', 'H', 'C', 'C']
--------------------------------------
GridOptRecord id : 17247256
SMILES           : [H]c1c(nnc(c1N([H])C([H])([H])[H])[H])[H]
Restraint indices: [ 7 14  3  4]
Number of atoms  : 12
Restraint atoms in OEMol: ['N', 'H', 'N', 'N']
--------------------------------------
GridOptRecord id : 17247263
SMILES           : [H]c1c(c(c(c(c1[H])[H])N(S(=O)(=O)C([H])([H])[H])S(=O)(=O)C([H])([H])[H])[H])[H]
Restraint indices: [ 8  5 13 14]
Number of atoms  : 12
Restraint atoms in OEMol: ['H', 'C', 'C', 'H']
--------------------------------------
GridOptRecord id : 17247278
SMILES           : [H]n1c(nnn1)N([H])[H]
Restraint indices: [5 7 8 0]
Number of atoms  : 6
Restraint atoms in OEMol: ['N', 'H', 'H', 'H']
--------------------------------------
GridOptRecord id : 17247279
SMILES           : [H]n1c(nnn1)N(c2nnnn2[H])c3nnnn3[H]
Restraint indices: [15  0  1  2]
Number of atoms  : 5
Restraint atoms in OEMol: ['N', 'H', 'N', 'C']
--------------------------------------
GridOptRecord id : 17247280
SMILES           : [H]n1c(nnn1)N([H])c2nnnn2[H]
Restraint indices: [10 13  0  1]
Number of atoms  : 9
Restraint atoms in OEMol: ['N', 'H', 'H', 'N']
--------------------------------------
GridOptRecord id : 17247282
SMILES           : [H]c1c(c(nc(n1)[H])[H])N(c2c(nc(nc2[H])[H])[H])c3c(nc(nc3[H])[H])[H]
Restraint indices: [18  9 10 11]
Number of atoms  : 14
Restraint atoms in OEMol: ['H', 'N', 'C', 'C']
--------------------------------------
GridOptRecord id : 17247283
SMILES           : [H]c1c(c(nc(n1)[H])[H])N([H])c2c(nc(nc2[H])[H])[H]
Restraint indices: [12 19  6  7]
Number of atoms  : 12
Restraint atoms in OEMol: ['C', 'H', 'N', 'H']
--------------------------------------
GridOptRecord id : 17247286
SMILES           : [H]c1c(n(c(c1N([H])c2c(c(n(c2[H])[H])[H])[H])[H])[H])[H]
Restraint indices: [10 19  6  7]
Number of atoms  : 12
Restraint atoms in OEMol: ['C', 'H', 'N', 'H']

Additional context

Python 3.7.3, Ubuntu 18.04

Library versions

mattwelborn commented 4 years ago

Thank you for your detailed report. This might be related to #435. @dgasmith is the expert on this, but he's traveling today.

I would suggest updating to the latest qcportal (0.13.0), and seeing if that resolves the issue.

dgasmith commented 4 years ago

Taking your first example it appears to be correct:

>>> proc = client.query_procedures(id=17247212)[0]
>>> proc.keywords.scans[0].indices
[4, 13, 2, 3]
>>> client.query_molecules(id=proc.initial_molecule)[0].geometry.shape
(14, 3)

Please note that query order isn't guaranteed which is where items are going wrong in your code I believe. I highly recommend using queries like ds.query("default") which handle the ordering gracefully and correctly.

btjanaka commented 4 years ago

Thanks for the quick response from both of you! I am wondering, what is ds.query("default") supposed to do? When I use it, all I get back is the string default. I have read the docstring for the method, and this seems to make sense, but I am not sure what I should be doing with this output.

dgasmith commented 4 years ago

We should probably change that, but ds.query('default') populates the underlying ds.df DataFrame so that get_entry 1) becomes very fast and 2) correctly maps the entries to a record.

We should change the return to return df[spec.name] here.

@btjanaka Could you make a PR if you think that would make it more straightforward?

btjanaka commented 4 years ago

@dgasmith will do, thanks for the help!

mattwelborn commented 4 years ago

Closed by #565.

btjanaka commented 4 years ago

As a followup, everything worked out when I tried again with ds.query. Here is the updated code:

import numpy as np
import qcportal as ptl
from openeye import oechem, oedepict
from tqdm.notebook import tqdm

# Loading Data
client = ptl.FractalClient()
ds = client.get_collection('GridOptimizationDataset',
                           "OpenFF Trivalent Nitrogen Set 3")
ds.query("default")

# Looking at the restraint indices in each record, and seeing whether they are
# available in the molecule
for index in tqdm(ds.df.index):
    ds_entry = ds.get_entry(index)
    record = client.query_procedures(ds_entry.object_map['default'])[0]
    molecule = client.query_molecules(ds_entry.initial_molecule)[0]
    smiles = ds_entry.attributes["canonical_explicit_hydrogen_smiles"]
    restraint_indices = np.array(record.keywords.scans[0].__dict__['indices'])
    num_atoms = len(molecule.symbols)

    # If any restraint index is out of bounds
    if np.any(restraint_indices >= num_atoms):
        print("--------------------------------------")
        print("GridOptRecord id :", record.id)
        print("SMILES           :", smiles)
        print("Restraint indices:", restraint_indices)
        print("Number of atoms  :", num_atoms)

        mol = oechem.OEMol()
        oechem.OESmilesToMol(mol, smiles)
        idx_to_symbol = dict(
            (atom.GetIdx(), oechem.OEGetAtomicSymbol(atom.GetAtomicNum()))
            for atom in mol.GetAtoms())
        print("Restraint atoms in OEMol:",
              [idx_to_symbol[idx] for idx in restraint_indices])
dgasmith commented 4 years ago

Awesome, that looks much more in line with what I hope our API can deliver.

Although, I highly recommend something like ds.get_record(entry_name, 'default'). Currently you will query the records twice with your upper level ds.query call.