Becksteinlab / MDPOW

Calculation of water/solvent partition coefficients with Gromacs.
https://mdpow.readthedocs.io
GNU General Public License v3.0
25 stars 10 forks source link

Dihedral Plots: RDKit Mol Object #243

Closed cadeduckworth closed 1 year ago

cadeduckworth commented 1 year ago
codecov[bot] commented 1 year ago

Codecov Report

Merging #243 (4af6de6) into develop (3b93aad) will increase coverage by 0.38%. The diff coverage is 99.02%.

@@             Coverage Diff             @@
##           develop     #243      +/-   ##
===========================================
+ Coverage    80.73%   81.12%   +0.38%     
===========================================
  Files           15       15              
  Lines         1905     1960      +55     
  Branches       294      296       +2     
===========================================
+ Hits          1538     1590      +52     
- Misses         276      278       +2     
- Partials        91       92       +1     
Impacted Files Coverage Ξ”
mdpow/workflows/dihedrals.py 95.93% <99.01%> (+1.05%) :arrow_up:
mdpow/workflows/base.py 77.55% <100.00%> (-4.09%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

cadeduckworth commented 1 year ago

@orbeckst I think I fixed the bond highlighting issue and image size issue. I incorporated the new algorithm/methods with the 'low res' plotting functionality so that there is a functioning base to improve upon.

Currently going back and looking at our previous conversations and the new links you sent to improve upon the existing methods..

cadeduckworth commented 1 year ago

image

orbeckst commented 1 year ago

That looks better. Very good. Did you check other examples, too?

The other comments were along the lines

cadeduckworth commented 1 year ago

I checked the plots for that molecule that I have been testing on, but no others yet.

I changed a lot of things that will be helpful moving forward, especially for testing. The information now used for Mol objects almost all come directly from atom_indices, which uses the RDKit conversion method that is tested and verified from PR#217.

For each dihedral group in a molecule the atom_indices and bond_indices are saved in a dictionary, where the key is in the format of 'C1-C2-C3-C4' to correspond to MDAnalysis selection style, so that the results DataFrame can be the source that calls the associated plotting graphics info. That way, the DF determines which atom and bond indices are highlighted, rather than an iterative method.

EDIT (reminder): @orbeckst - relevant to discussion from our meeting on 6/29/23.. I did not remember because it was several months ago, but there was a method to the madness all along!

cadeduckworth commented 1 year ago

@orbeckst

cadeduckworth commented 1 year ago

atom_indices added to label Mol object on plots

To-Do:

cadeduckworth commented 1 year ago

image

cadeduckworth commented 1 year ago

cleaner plot titles

image

cadeduckworth commented 1 year ago

This is where I have landed with the combined RDKIT/matplotlib to SVG composition. (SVG converted to PDF attached below)

With the number of atoms in the SAMPL9 molecules, and the small text size of the atom indices, the SVG plots will look much better for those purposes. It should also maintain a consistent box size for all of the molecules, so the orientation or overall size of the molecule should not cause the image to zoom in or out like the MolToImage RDKit PDF output method.

Regarding the PDF output method, I figured out that, because the default size of the MolToImage conversion is 300x300, 'square' shaped molecules end up appearing bigger under increased extent values, which means the 'scale' cannot really afford to be 'zoomed in' for horizontally oriented molecules that assume more of a 'linear' conformation. There exists a noticeable difference between the aspect options, auto and equal, but changing those without changing the frame size and other parameters causes distortion. All of this of course can be changed, but would be difficult to detect how to do so for each case. For the current methods that output PDF, the best combination I have come across is the most recently pushed commit (see also, plot directly beneath it), and likely suits most sizes/shapes (needs further testing to be sure).

Regarding the SVG output method, the svgutil code, for me, has been tricky for size, scale, and placement. There is a way to get the sizes after the original Mol or MPL objects are saved to SVG files which could make things easier, but it seems like minor changes in different steps of processing can cause big downstream changes for placement/proportions/scales. There is also a method that converts units between px, cm, pt, etc., but I have not had a chance to get into it yet.

It might be best to use what I have developed for SVGs as an option that can be specified, for now, and use the pdf method as default, unless conda has something similar to inkscape that we can include as a dependency in order to facilitate viewing, conversion, and notebook embedding?

test.pdf

orbeckst commented 1 year ago

The PDF looks sharp and can be zoomed in βœ…

The next thing is to make sure that we don't need to scale it to out it into a paper. To make it publication-ready we would want it to be "at final size", i.e., typical two-column wide figure (about 190 mm, just randomly used https://writing.stackexchange.com/questions/21658/what-is-the-image-size-in-scientific-paper-if-indicated-as-a-single-1-5-or-2-c ) and font size should not be smaller than 9 pt. I think the matplotlib default is 10 so if you make figures at final size, everything should look good.

orbeckst commented 1 year ago

(see also https://github.com/Becksteinlab/making-prettier-plots/blob/master/Making%20better%20plots.ipynb )

cadeduckworth commented 1 year ago

@orbeckst

Here are some plotting results after implementing existing method with svgutils and new method with cairosvg into dihedrals.py module.

svgutils is being used to create the SVG files/graphics and layer them. cairosvg is being used to convert to pdf and scale/save in the desired size. I added a kwarg to adjust the final exported pdf plot width, with the default set to 190mm. I also added conversion to pixels for compatibility with cairosvg.svg2pdf.

Example with a fairly cluttered molecule from SAMPL9 set (step=1000), before I added a little white space to the right border to allow for longer solvent names: S07_C21-C22-C25-F27_violins.pdf

SM25 & SM26 (testing resources) after increasing whitespace on right border: SM25_C2-N3-S4-C5_violins.pdf SM26_O1-S2-C3-C4_violins.pdf

I have not been able to figure out how to suppress matplotlib output, which does not contain the Mol image.

Moving forward I am working on fixing the tests to accommodate what was changed during development. After I achieve full coverage again I will request a review and start making changes to the documentation to reflect changes and new functionality.

EDIT: see parentheses

cadeduckworth commented 1 year ago

@orbeckst

The codecov report of 98.80% is misleading; some of my current tests are weak at best, and some are possibly not testing anything at all. Once the new SVG methods are finalized, more specific tests need to be written, existing tests need to be revisited, and fixture scopes need to be optimized. Logger/error info needs to be added or updated. Once PR229 is merged the test_workflows_base.py module needs to test iterative usage of new plotting functionality for multiple projects. Docs will need a thorough rewrite. Tutorials and examples will need to be revisited.

The 'plotting only one solvent issue' needs to be revisited (see most recent comments, SM25 & SM26 pdfs). The current method almost works, but results in an unnecessary solvent label in the legend. I have already already come up with a way to probably fix this but have not tested it yet.

The plots look nice, they seem scalable (a few examples showed signs of adequate scaling, see also most recent comment for S07, SM255, SM26), and I am pleased with the combination of svgutils and cairosvg. I am not sure if the problem exists outside of Jupyter or IDEs, but I tried my hand at suppressing the matplotlib output with no luck.. and the output is a plot which does not contain the Mol object image, so it is annoying. I will look more into this.

EDIT: As a forewarning the changed files here are littered with comments and bad practices, so I will try to clean those up before you get a chance to review. This one required changes on more levels, so PR229 and PR242 are more complete and take precedence.

cadeduckworth commented 1 year ago

Reminder for myself:

https://github.com/Becksteinlab/making-prettier-plots/blob/master/Making%20better%20plots.ipynb

cadeduckworth commented 1 year ago

@orbeckst

For the plots, SVG to PDF no longer uses temporary files, and the one/two solvent plotting issue has been resolved. If everything looks good I will start cleaning things up and finishing the docs and tests.

EDIT: due to some of the new additions, it might be beneficial to split up some of the new or existing functions into several smaller functions

cadeduckworth commented 1 year ago

@orbeckst All requested changes addressed (except for some docs) and new tests developed. Please do a quick review and if everything looks good, then I will do a final cleanup and complete all new documentation.

EDIT: trouble with 3.7 again, see most recent CI

cadeduckworth commented 1 year ago

@orbeckst There is some documentation that is lacking or needs attention, but I am hesitant to commit to too much of a docs overhaul since much is anticipated to change in #244.

Is it possible that we can decide what should be minimally documented to avoid too many rewrites?

EDIT: there is much cleaning up and many kwarg consistency/dependency issues I need to fix, so let's reassess the docs situation once I have that sorted out.. sorry for any confusion.

cadeduckworth commented 1 year ago

@orbeckst I did a first pass on updating the docs and kwargs, but I will come back to it with fresh eyes later. I had to put a bandaid on the figdir situation (commit - 60636b4) so the base module did not break, but as it is written, figdir really should be required as a positional argument for this PR. The current fix works pending issue #244.

Docs problems worth mentioning are intersphinx references for MDA select_atoms and the svgutils module.. haven't figured those out yet after including in mdpow/doc/sphinx/source/conf.py.. however the svgutils docs do not look like they support it, or I am looking in the wrong place. Also, I have not yet mentioned cairosvg so I need to test that one out in the docs too.

I am still having issues with tests for Python 3.7, so I will come back to this after the weekend.

I would like to finish up this PR as soon as possible and take care of #244 so we can move forward with SAMPL9. If you get a chance to review this over the weekend that would be great, otherwise we can catch up on Monday. Thanks, and have a great weekend!

cadeduckworth commented 1 year ago

reminder for myself, write test for the following:

if molname is None:
    molname = resname
test_resname = 'foo'
run_analysis(resname=test_resname)
assert molname == test_resname

EDIT: not necessary, expected behavior is explicitly defined and cannot fail

orbeckst commented 1 year ago

mdpow/doc/sphinx/source/conf.py.. however the svgutils docs do not look like they support it, or I am looking in the wrong place.

See https://www.sphinx-doc.org/en/master/usage/extensions/intersphinx.html for the intersphinx dict.

orbeckst commented 1 year ago

Ping me if you have specific questions or if the PR is ready for review.

cadeduckworth commented 1 year ago

reminder: check if intersphinx mapping works for cairosvg, svgutils, and pypdf

cadeduckworth commented 1 year ago

@orbeckst It looks like some of the failed tests are related to RDKit. There may have been an update, so I will look into it.

cadeduckworth commented 1 year ago

@orbeckst I think this might be the cause of four tests that fail starting with Python>=3.9: RDKit issue

The environment used for the COW analyses includes RDKit 2022.09.1, and the issue was addressed in a build of RDKit version 2023.03.1. The builds appear to be specific to Python versions, and it looks like some cases of index ordering (SMARTS) is the only relevant change causing those tests to fail.

Python=3.7 CI uses the 2022 version, and Python=3.8 uses a build from the 2023 version, but neither fail, so I am unsure how to solve this problem for all possible version combinations. At first glance, based on the atoms/bonds highlighted in our plots for COW analysis, it seems like there are no errors with the 2022 version (just different indices than some 2023 builds I guess).

EDIT: Python=3.10 with RDKit=2022.09.1, the version combo in the environment I am using, seems to be working for our purposes.

orbeckst commented 1 year ago

I looked at the test failures but I don't know what is the correct output. Is it possible that our current reference values are incorrect?

We could just require rdkit >=2023 (which is available from 3.8 up) (and then axe 3.7).

cadeduckworth commented 1 year ago

I looked at the test failures but I don't know what is the correct output. Is it possible that our current reference values are incorrect?

We could just require rdkit >=2023 (which is available from 3.8 up) (and then axe 3.7).

@orbeckst The issue with this solution is that the 2023 build for 3.8 passes (the builds are version specific), with the same results as the 2022 build for 3.7. The change that causes test failure begins with RDKit 2023, Python >= 3.9.

cadeduckworth commented 1 year ago

Another failed test unique to Python=3.7:

______________ TestAutomatedDihedralAnalysis.test_DataFrame_input ______________

self = <mdpow.tests.test_automated_dihedral_analysis.TestAutomatedDihedralAnalysis object at 0x7f854ad789d0>
SM25_tmp_dir = PosixPath('/tmp/pytest-of-runner/pytest-0/test_DataFrame_input0/SM25')
dihedral_data = (             selection solvent interaction lambda     time    dihedral
0         O1-S4-C6-C10   water     Coulomb   0...1000  50000.0  211.129909
15415  S4-O1-C5-C9   water         VDW   1000  50000.0  207.238198

[15416 rows x 6 columns])

    def test_DataFrame_input(self, SM25_tmp_dir, dihedral_data):
        df, _ = dihedral_data
        dihedrals.automated_dihedral_analysis(dirname=SM25_tmp_dir, figdir=SM25_tmp_dir,
                                              resname=resname, molname=molname,
                                              solvents=('water',), dataframe=df)
>       assert (SM25_tmp_dir / 'SM25' / 'SM25_C10-C5-S4-O11_violins.pdf').exists(), 'PDF file not generated'
E       AssertionError: PDF file not generated
E       assert False
E        +  where False = <bound method Path.exists of PosixPath('/tmp/pytest-of-runner/pytest-0/test_DataFrame_input0/SM25/SM25/SM25_C10-C5-S4-O11_violins.pdf')>()
E        +    where <bound method Path.exists of PosixPath('/tmp/pytest-of-runner/pytest-0/test_DataFrame_input0/SM25/SM25/SM25_C10-C5-S4-O11_violins.pdf')> = ((PosixPath('/tmp/pytest-of-runner/pytest-0/test_DataFrame_input0/SM25') / 'SM25') / 'SM25_C10-C5-S4-O11_violins.pdf').exists

mdpow/tests/test_automated_dihedral_analysis.py:310: AssertionError
orbeckst commented 1 year ago

I am going to remove 3.7 #248 in a new PR, that's one thing less to worry about.

orbeckst commented 1 year ago

@cadeduckworth , the MDAnalysis RDKit converter docs https://docs.mdanalysis.org/stable/documentation_pages/converters/RDKit.html#MDAnalysis.converters.RDKit.RDKitConverter state

Changed in version 2.2.0: Improved the accuracy of the converter. Atoms in the resulting molecule now follow the same order as in the AtomGroup.

That's relevant because the Notes stated

The _MDAnalysis_index property of the resulting molecule corresponds to the index of the specific AtomGroup that was converted, which may not always match the index property.

I think this says that for MDA >= 2.2, we can assume that MDA atom indices and RDKIT mol indices agree.

orbeckst commented 1 year ago

@cadeduckworth I merged #249 β€” please go ahead with your PR! πŸš€

cadeduckworth commented 1 year ago

reminder: check if intersphinx mapping works for cairosvg, svgutils, and pypdf

To-Do

cadeduckworth commented 1 year ago

@orbeckst

The codecov report of 98.80% is misleading; some of my current tests are weak at best, and some are possibly not testing anything at all. Once the new SVG methods are finalized, more specific tests need to be written, existing tests need to be revisited, and fixture scopes need to be optimized. Logger/error info needs to be added or updated. Once PR229 is merged the test_workflows_base.py module needs to test iterative usage of new plotting functionality for multiple projects. Docs will need a thorough rewrite. Tutorials and examples will need to be revisited.

The 'plotting only one solvent issue' needs to be revisited (see most recent comments, SM25 & SM26 pdfs). The current method almost works, but results in an unnecessary solvent label in the legend. I have already already come up with a way to probably fix this but have not tested it yet.

The plots look nice, they seem scalable (a few examples showed signs of adequate scaling, see also most recent comment for S07, SM255, SM26), and I am pleased with the combination of svgutils and cairosvg. I am not sure if the problem exists outside of Jupyter or IDEs, but I tried my hand at suppressing the matplotlib output with no luck.. and the output is a plot which does not contain the Mol object image, so it is annoying. I will look more into this.

EDIT: As a forewarning the changed files here are littered with comments and bad practices, so I will try to clean those up before you get a chance to review. This one required changes on more levels, so PR229 and PR242 are more complete and take precedence.

To-Do

cadeduckworth commented 1 year ago

To-Do

New software requirements:

cadeduckworth commented 1 year ago

@orbeckst Please see lines 143 and 144 (dihedrals.build_universe). If MDPOW supports single solvent Pow calculations (does it?), should we not always assume that results will be available for water by default?

EDIT: Maybe it is best to default to using the topology and trajectory from the first solvent specified?

orbeckst commented 1 year ago

It supports single solvent solvation free calculations, yes. You're right, the water directory does not necessarily exist.

However, just raise an issue that we improve build_universe() to be smarter about finding the topology. Right now it's working for everything that we need.

Let's not use the ITP file because that might not be present, e.g., when someone else ran just the FEP on a cluster.

orbeckst commented 1 year ago

Actually, your idea to use the first solvent seems pretty good and easy enough to implement. You could even make solvent='water' a kwarg and then the default is the current behavior.

cadeduckworth commented 1 year ago

It supports single solvent solvation free calculations, yes. You're right, the water directory does not necessarily exist.

However, just raise an issue that we improve build_universe() to be smarter about finding the topology. Right now it's working for everything that we need.

Let's not use the ITP file because that might not be present, e.g., when someone else ran just the FEP on a cluster.

ITP file?

orbeckst commented 1 year ago

ITP file?

I was thinking along what I used for the dihedral sampler: use the initial files in the top level directories, including the topology file, but on second thought it makes no sense ebcause that files does not contain the solvent and is not useful for using with the trajectory. The TPR is really what we want.

Sorry to have confused you.

cadeduckworth commented 1 year ago

in response to

Docs say hue_order takes a list of strings, so we will do the list conversion as part of the kwarg.

EDIT: I think hue_order=["water", "octanol"] was the original kwarg from the dihedral violins code you provided me with to start developing the figure functions a long while ago.

cadeduckworth commented 1 year ago

@orbeckst Ready for review, all requested changes resolved or responded to. I raised several new issues and commented or updated some existing issues to document what needs to be done soon, but is not immediately necessary for merging this PR. Minor reviewing left to do on my part tomorrow as a sanity check, but I hope to get this merged and resubmit the SAMPL9 jobs Monday afternoon.

orbeckst commented 1 year ago

A momentous occasion: PR #243 has been merged. πŸš€ 🌟

Congratulations!