TUCAN-nest / TUCAN

A molecular identifier and descriptor for all domains of chemistry.
https://tucan-nest.github.io
GNU General Public License v3.0
22 stars 5 forks source link

Web-based demo version #67

Open JanCBrammer opened 1 year ago

JanCBrammer commented 1 year ago

Having a website (e.g., SPA) for demo purposes would increase accessibility a lot. Currently, the accessibility of TUCAN is too limited, since the package needs to be installed in a local Python environment.

flange-ipb commented 1 year ago

My investigations so far: TUCAN cannot be used via the PyScript/Pyodide ecosystem at the moment. Pyodide requires architecture-independent ("pure") wheels, which are not present for the igraph dependency. TUCAN itself can be built as pure wheel. The ecosystem is still heavily under development, so I expect future improvements.

@rapodaca has written several blog articles and GH-repos about porting the InChI C-library to JavaScript and WebAssembly:

Thus, I'll try to convert TUCAN to C as a next step (maybe with Cython) ...

rapodaca commented 1 year ago

A possibly less suitable option would be to run TUCAN on a server and expose the interface through a browser.

See also Running InChI Anywhere with WebAssembly for an overview of efforts around InChI and Wasm.

schatzsc commented 1 year ago

Frank - what is still missing for the web demo? Checked the last version with Ketcher as editor and it looked pretty much complete to me ...

flange-ipb commented 1 year ago

The feedback from our project team was quite positive, so I just moved the demo website to https://tucan-nest.github.io by renaming the repository.

This covers the following use-cases:

Other use-cases that will be added as soon as TUCAN supports it:

schatzsc commented 1 year ago

So can we promote this in the public now or you would like to wait for some more time for feedback?

flange-ipb commented 1 year ago

Yes, it can be promoted.

flange-ipb commented 1 year ago

A few words on the technical background: The application uses Pyodide as Python runtime environment in the browser. The key improvement of Pyodide a few weeks ago was that it can load Python wheels built specifically for the WebAssembly (WASM) platform and not just pure wheels.

igraph wheel: igraph is the only dependency of TUCAN that includes C code and needs to be ported to the WASM platform. This was done with great support from igraph's developers. At the moment, the build of igraph-0.9.11-cp310-cp310-emscripten_3_1_14_wasm32.whl isn't automated and there are no package repositories that accept WASM wheels, so I deposited the file in the repo.

TUCAN wheel: As mentioned earlier, the TUCAN library can be built as pure wheel. tucan-0.1.0-py2.py3-none-any.whl is also deposited in the the repo.

Glue code: Most of the logic (initialization, DOM-manipulation, events) is implemented with JavaScript and Python is only invoked when converting Molfile to TUCAN. The molfile_to_tucan function is returned as function pointer in the last line of tucanweb.py and becomes accessible as proxy object molfile_to_tucan in JavaScript. This pattern should be extensible by returning a tuple of function pointers.

Chemical structure editor: We decided to switch from OpenChemLib JS to Ketcher2 for better usability. Using Ketcher poses a few challenges:

Implicit hydrogens: Like most structure editors Ketcher does not offer any functionality to add implicit hydrogens in the exported Molfiles. At the moment this is done using one of OpenChemLib's functionalities.

schatzsc commented 1 year ago

BTW, just the repository or also the demo website?

flange-ipb commented 1 year ago

BTW, just the repository or also the demo website?

Do you mean what should be promoted? I think just the URL https://tucan-nest.github.io. People interested in the website's source code will be clever enough to find the repository because GitHub pages always follows the same pattern.

schatzsc commented 1 year ago

Ok, it was not clear to me that the running demo is also included there. Then I will start to promote the link that you included above on social media later today.

JanCBrammer commented 1 year ago

@flange-ipb, can we close this issue?

flange-ipb commented 1 year ago

Not yet. The features related to TUCAN-to-Molfile conversion are still missing and we may need some discussions on the web design. graph_to_molfile is the last piece from the TUCAN library required for this and I'll PR it in a few minutes.

schatzsc commented 1 year ago

Just found another problem - in some cases the coordinate generation results in very small (< 10e-7) values which makes little sense. Therefore, I would like to suggest to set coordinate values to zero if initial values is below a certain threshold, best < 0.0001 since most experimental and theoretical methods cannot provide more accurate values.

tc
flange-ipb commented 1 year ago

Thanks for pointing this out. Indeed, I incorrectly use the format .6g to write floating-point numbers in the Molfile writer, which switches to between fixed-point and scientific notation depending on the magnitude. Using .6f gives:

M  V30 1 H 0.866027 -0.500000 0.000000 0
M  V30 2 H -0.000000 1.000000 0.000000 0
M  V30 3 H -0.866027 -0.500000 0.000000 0
M  V30 4 O -0.000000 -0.000001 0.000000 0

It adds trailing zeros (you can also see this in Molfiles exported by ChemDraw). The "negative zero" is a bit flawed.

schatzsc commented 1 year ago

I think that does not matter that much.

rapodaca commented 1 year ago

The spec is silent on the issue of number format for coordinates. It doesn't follow the "Fortran format" convention for decimal values in the header. Still, no example in any documentation I've founded uses exponential notation. So the V3K library I'm working on Trey, will report an error if given exponential notation. Other tools may do the same.

schatzsc commented 1 year ago

I'm not that familiar with the code that Frank Lange (flange-ipb) has contributed but the calc_coordinates are derived in cases where no coordinates are initially available, as in TUCAN-to-molfile conversion. This is due to the fact that the TUCAN string by default does not include any coordinates, only reflects the topology, not exact 2D/3D arrangement of nodes/atoms (although one might define coordinates as "custom node attributes" but that's not part of the main implementation).

Might come from ketcher editor?

Still, even if it is just for more pleasant reading by humans, I'd prefer not to have the exponential notation. Normal high-quality low-temperature X-ray diffraction gives coordinates to three relevant digits behind the comma. DFT will provide more digits but since they cannot be calibrated against experimental data of that accuracy are more or less useless.

The .6f notation that Frank has now switched to is already total overkill for experimentalists. I hope the leading zero does not cause any problems. Python will interpret -0 as int and -0.0 as float with abs(-0.0) converting to 0.0

coords

flange-ipb commented 1 year ago

Might come from ketcher editor?

No, in case calc_coordinates is used, all coordinates are overwritten by those generated by the Kamada-Kawai layout algorithm.

It's interesting to see how others implemented coordinate serialization:

schatzsc commented 1 year ago

Thank you very much for checking how rdkit and CDK do this.

Could you please also link to the part of the TUCAN code where the Kamada-Kawai coordinates are collected and then passed on to molfile generation?

In visualization.py there is def _draw_networkx_graph_3d(m, highlight, labels, fig, col): which uses coords = nx.kamada_kawai_layout(m, dim=3) to set the coordinates but I don't find the molfile writer!?!

flange-ipb commented 1 year ago

That's not yet merged. See my pull request.

schatzsc commented 1 year ago

That's too many changes - I better leave it to @JanCBrammer to approve and merge ;-)

But now I got it:

coords = nx.kamada_kawai_layout(graph, dim=2)

Really just directly coming from the NetworkX Kamada-Kawai routine.

In case somebody is interested in the original paper, it is here:

https://dx.doi.org/10.1016/0020-0190(89)90102-6

T. Kamada, S. Kawai, An algorithm for drawing general undirected graphs, Information Processing Letters, 1989, 31, 7-15

schatzsc commented 1 year ago

In the web demo, when using Convert TUCAN to structure or Convert TUCAN to molfile the following input generates an error This is not a canonical TUCAN string. which is a correct statement:

C6H14N2/(22-20)(21-19)(20-19)(19-18)(18-17)(17-16)(16-15)(15-20)(1-22)(2-22)(3-21)(4-21)(5-20)(6-21)(7-18)(8-18)(9-17)(10-17)(11-16)(12-16)(13-15)(14-15)

Still, it would be possible to convert this to a graph and then generate a canonical TUCAN from it for further processing. Is it intended behaviour here to terminate with an error message or shouldn't there at least also be a function Convert TUCAN to canonical TUCAN? The only requirement for that is that the node labels are not out of range.

flange-ipb commented 1 year ago

Good point! I already calculate the canonical TUCAN for comparison with the TUCAN in the input field, so it could be shown in the yellow warning box:

Screenshot

schatzsc commented 1 year ago

Since this functionality is already around - do you think it would be useful to have a separate "generic TUCAN to canonical TUCAN" option, where you can type in any TUCAN and then get the canonical one out? Such a string-to-string conversion is currently missing ...