Closed Lowpassfilter closed 5 years ago
Hi there,
I think the 'logwrite' is just some debug information. You can safely ignore this file. The mol_lib.py will convert the raw text data into binary features for fast loading.
For harvard_cep, all you need is just a list of [SMILES property_value]. See the data folder in the dropbox for more information:
https://www.dropbox.com/sh/eylta6a24fc9xo4/AAANyIgKnq49HB0Ud989JGEZa?dl=0
Dear Hanjun,
Thank you for your reply and time. I see. The .bin file is the cooked data, instead of the logwrite. After running your code, I got some other questions:
The code relies on RDKit. So any SMILES which can be parsed by RDKit should be fine.
You don't have to understand the binary format. But if that really matters, you can check the c++ part: https://github.com/Hanjun-Dai/pytorch_structure2vec/blob/master/harvard_cep/src/mol_lib.cpp#L143
Maybe you want to tune the hyper-params a bit? But generally the loss is cross entropy, which is not the same objective as accuracy (but definitely correlated).
There could be multiple ways to do that. To list some: a) treat it as the feature of the edge; b) do the message passing multiple times; c) just ignore the duplication.
Hi hanjun, I got a question about how to prepare the input data for your project. After I run /harvard_cep/mol_lib.py, I got a file named "logwrite". But the data format in the logwrite file is different from what is introduced in /graph_classification/README.md.
As I understood, in "logwrite", each drug is converted into the following format:
N M N lines of numbers
what is the meaning of this format?
Thank you for your time.