Question re: pretrained weights

sooheon commented 4 years ago

I would like to confirm whether the pretrained weights available in the README are just from "masked input node" prediction, and not of the final trained MAT. I assume this is the case because it skips loading any generator weights (which would differ for each task).

When you do transfer learn onto a specific task, do you do any freezing and gradual thawing of the encoder weights, or just train right away?

sooheon commented 4 years ago

I see this is true, as the pretrained weights for the generator have 28 outputs, for each atomic feature.

Are there any plans to open source the node masking training code in the future? I'm curious if/how you adapted this to work with the transformer as opposed to gnns.

Mazzza commented 4 years ago

Hello,

We use MAT weights obtained from the "masked input node" prediction pretraining. During fine-tuning we do not freeze any weights of the network. All layers of MAT are trained.

We are currently working on the various methods of graph pre-training and we plan to release the code when we finish.

Mazzza commented 4 years ago

Closing for now. Please reopen if you have any other questions.

sooheon commented 4 years ago

I've been thinking more about the pretraining methods. I see that node masking is analogous to BERT style Cloze task and is straightforward. I'm having more difficulty understanding how edge masking would work.

After reading the graph pretraining paper, I'm thinking something like:

mask random cells of distance and/or adj matrix (being sure not to leak info across diagonal)
run encoder as normal, this outputs contextually embedded atoms
decoder sums pairs of atom embeddings, and runs through FCN to output distance/adj value per pair.
- Some loss can be computed from this -- but I'm not sure how this pairwise edge prediction can be made differentiable
- or, rather than constructing "edges", pass in the whole mol embedding matrix, and have the decoder output N_atoms x N_atoms matrix representing dist/adj matrix.

Actually it seems like if you fully mask the distance matrix, the task is training a molecule conformer.

How are you guys approaching this?

ardigen / MAT

Question re: pretrained weights #11