data cooking question - Githubissues

Lowpassfilter commented 5 years ago

Hi hanjun, I got a question about how to prepare the input data for your project. After I run /harvard_cep/mol_lib.py, I got a file named "logwrite". But the data format in the logwrite file is different from what is introduced in /graph_classification/README.md.

As I understood, in "logwrite", each drug is converted into the following format:

N M N lines of numbers

what is the meaning of this format?

Thank you for your time.

Hanjun-Dai commented 5 years ago

Hi there,

I think the 'logwrite' is just some debug information. You can safely ignore this file. The mol_lib.py will convert the raw text data into binary features for fast loading.

For harvard_cep, all you need is just a list of [SMILES property_value]. See the data folder in the dropbox for more information:

https://www.dropbox.com/sh/eylta6a24fc9xo4/AAANyIgKnq49HB0Ud989JGEZa?dl=0

Lowpassfilter commented 5 years ago

Dear Hanjun,

Thank you for your reply and time. I see. The .bin file is the cooked data, instead of the logwrite. After running your code, I got some other questions:

What is the standard for the SMILES format? for example, there are canonical SMILES and isometric SMILES, which one do you use? or both of them are OK?
Is there any specification for the format of the .bin file? I have checked your code, but it seems that it calls a LoadMolGraph function from dll directly.
When running /graph_classification/run.sh, I found that the loss for testing set decrease at first, but then increase to above 3. but the acc for testing set increase monotonically. How to can acc and loss increase at the same time?
(This one may involve conflict of interest, my apologize if it relate to your recent research). When doing graph embedding, what if there are more than one edges between two vertices? In your paper, it seem to assume that there are one or zero edge between two vertices.

Hanjun-Dai commented 5 years ago

The code relies on RDKit. So any SMILES which can be parsed by RDKit should be fine.
You don't have to understand the binary format. But if that really matters, you can check the c++ part: https://github.com/Hanjun-Dai/pytorch_structure2vec/blob/master/harvard_cep/src/mol_lib.cpp#L143
Maybe you want to tune the hyper-params a bit? But generally the loss is cross entropy, which is not the same objective as accuracy (but definitely correlated).
There could be multiple ways to do that. To list some: a) treat it as the feature of the edge; b) do the message passing multiple times; c) just ignore the duplication.

Hanjun-Dai / pytorch_structure2vec

data cooking question #15