connorcoley / rexgen_direct

Template-free prediction of organic reaction outcomes
GNU General Public License v3.0
151 stars 69 forks source link

value overwrite #20

Open LanceKnight opened 3 years ago

LanceKnight commented 3 years ago

Hi, I don't quite understand the following code excerpt from get_bin_feature(r, max_natoms) in ioutils_direct.py. From importing mol_graph.py, the bond_fdim = 6, so f[1:1+bond_fdim] is f[1:7], which will be overwritten by f[-4] and f[-3], so I'm not sure what this is doing. Also, f[-4] and f[-3] seems to be checking whether two atoms are in the same molecule. I'm not sure why we need both f[-4] and f[-3] to check this. Isn't using one of them enough?

`for i in range(max_natoms):

for j in range(max_natoms):

    f = np.zeros((binary_fdim,)) # binary_fdim = 4+bond_fdim, which equals to 10 in this case

    if i >= n_atoms or j >= n_atoms or i == j:

        features.append(f)

        continue

    if (i,j) in bond_map:

        bond = bond_map[(i,j)]

        f[1:1+bond_fdim] = bond_features(bond)

    else:

        f[0] = 1.0

    f[-4] = 1.0 if comp[i] != comp[j] else 0.0

    f[-3] = 1.0 if comp[i] == comp[j] else 0.0

    f[-2] = 1.0 if n_comp == 1 else 0.0

    f[-1] = 1.0 if n_comp > 1 else 0.0

    features.append(f)`
connorcoley commented 3 years ago

There is a small indexing bug that I've left in so the implementation matches what was used in the paper. See https://github.com/connorcoley/rexgen_direct/issues/17#issuecomment-738506026

f[-4] and f[-3] carry the same information; it's just a one hot encoding of a boolean. I suspect it makes almost no practical difference in terms of model training and performance. Using one should be sufficient!

LanceKnight commented 3 years ago

I see. Would you mind pointing out which part of the paper it tries to match? I checked the main script and supplement about the attention mechanism but I didn't find texts related to the get_bin_feature(r, max_natoms) other than a formula that gives attention score. ( In the docstring of get_bin_feature(r, max_natoms), it says it is used in attention mechanism).

Speaking of the paper, it seems there are duplicate 24's in Fig1.B and I guess one of them should be 34. And I guess Fig1.C, I guess the dimension of new bond order should be 5 according to the codes?

Sorry I'm a new student and I've never published in a journal so I'm not sure if I should point those out since they're very trivial things anyway. It is a great paper by all means.

connorcoley commented 3 years ago

The precise details of the implementation (i.e., the exact indexing of the binary features) wasn't described in the text. By wanting it to match, I more meant that I didn't want the quantitative results to change at all (e.g., even by 0.01%) since then the code wouldn't match what I ran for the experiments.

You're right that in the schematic in Figure 1B, one of the atom labels should be 34. Nobody, including myself, has noticed that! In the dataset, if my memory serves me correctly, there are no new aromatic bonds created. The code is implemented slightly more generally, however, which would give an extra dimension there.

There's no harm pointing these things out! The indexing of f[-4] and f[-3] is certainly a good catch, and it shows that you're digging deep into the implementation details and doing a great job following the logic in the code. Figure schematics/cartoons like Figure 1 are less important than the technical correctness of the methods and results, but no need to apologize. I wish I had caught these things myself :)

LanceKnight commented 3 years ago

Thank you professor for the answer and encouraging words :)

Another thing I noticed was that in Fig.2, the legend says the "cross in a circle" means outer product, but in the codes, it seems the operation is more like an element-wise multiplication?