TUCAN-nest / TUCAN

A molecular identifier and descriptor for all domains of chemistry.
https://tucan-nest.github.io
GNU General Public License v3.0
23 stars 5 forks source link

Decide on use of molfile v2000 vs. v3000 as input/output file format #15

Closed schatzsc closed 3 years ago

schatzsc commented 3 years ago

Currently, it seems like we are using a mix of the molfile v2000 vs. v3000 definitions due to the "hand-writing" of the molfiles.

For further use and expanded test set, we need to decide on one of the two formats.

Since the v2000 molfile is restricted to 255 (or 999) atoms maximum and actually uses "FORTRAN-style" "fixed-width" columns for input, this might be inconvenient and restrictive. Therefore, I'd have a preference to change to v3000 molfile exclusively as it also allowed to better define the other properties that are not atom type and "connectivity matrix".

OpenBabel seems like it can convert both:

But should consult first with Gerd on this issue.

schatzsc commented 3 years ago

Did you stay in the NFDI4Chem TA2 meeting on 08.09.2021 long enough for my discussion with Nicole Jung on the molfile format used within Chemotion? Sie hat mir dazu im Nachgang nochmal geschrieben "Bzgl. Marvin: ich hatte es gesagt, zur Sicherheit hier noch einmal kurz: Bis die Lizenz-Frage geklärt ist, können wir kein Molfile V3000 speichern. "

schatzsc commented 3 years ago

Agreed on 21.09. to use molfile v3000 for all further work

Need to adjust

def create_molecule_array(molfile_lines)

to reflect changes (or better directly read into adjacency matrix, node feature matrix, edge feature matrix). Will provide a v3000 molfile as an example - or create new testfile folder for v2000 vs. v3000?

For molfile structure, it would be easiest to create the node feature matrix first, since this is the first block of the molfile and then generate adjacency matrix and edge feature matrix almost together, as adjancy matrix just sets 0 or 1 for edge or not while edge feature matrix further specifies properties - if we don't take further properties for the moment they will actually be more or less identical, still better to already implement now