Found files that are slow to optimize in Reduce2

ReliaSolve / cctbx_project

Computational Crystallography Toolbox

https://cctbx.github.io

Other

0 stars 0 forks source link

Found files that are slow to optimize in Reduce2 #188

Open russell-taylor opened 2 years ago

russell-taylor commented 2 years ago

This file reports an irreducible clique of size 4 that has to be reduced using brute force. Reduce running on this file is very fast (fraction of a second) because it does not have any cliques, only singletons (even with -rad0.5 or -rad1.0). Reduce2 is many minutes (1909 seconds)

Fe_1brf_rubredoxin_0.95A.pdb (1brf.pdb)

https://app.zenhub.com/files/298640964/713768fa-a6c8-447a-9f2d-f15137ccc4cc/download

The residues are in chain a: Cys 5, Cys 8, Cys 38, Cys 41

Made separate issue for this: https://github.com/ReliaSolve/cctbx_project/issues/256

russell-taylor commented 1 year ago

(done) An incredibly slow file (more than 2 days) is 1zz0.pdb. Original Reduce takes several seconds to process the file. It has 30 cliques, with the largest of size 6 (at least 3 of them) and more total singles than the total Movers within cliques.

Rotatable hydrogens get placed in less than a minute.
Movers get placed within a minute or so.
Interaction graph, excluded atoms take maybe a minute.
Singletons are done very quickly.
Cliques go quickly until we run into one with 23 Movers!
- Removing all but the potential-acceptor angles from singleton rotators made the interaction graph much faster to compute but only reduced the size of the large clique to 21 Movers.
- Optimizations made through 8/23/2023 resulted in a runtime of under 8 minutes despite the 23-member and several 22-member cliques.
- Optimizations made through 8/25/2023 resulted in a runtime of 5 minutes 3 seconds despite the 23-member and several 22-member cliques.
- When we made the recursive calls C++ only on 8/27/2023, the runtime became 4 minutes, 43 seconds.
- When we move _scoreAtom into C++ and make it do caching, the runtime became 3 minutes, 34 seconds.
- When we moved singleton optimization into C++ it stayed about the same; 3 minutes, 39 seconds.
- When we moved m_coarseLocations to be internal C++, it dropped to 3 minutes, 25 seconds.
- When we made it faster to look up the Movers associated with an atom, it dropped to 3 minutes, 20 seconds.
- With optimizations made through 8/30/2023, it dropped to 2 minutes, 57 seconds. Add hydrogens: 16 seconds, interpret: 13 seconds, optimize 145 seconds (for first of two alternates: place Movers: 15 seconds, interaction graph: 2 seconds, singletons: 1.5 seconds, cliques 52 seconds; second was similar). <1% of the atoms were calculated, basically they were all cached.
- (done) Why is the largest 6 Movers in original Reduce and so much larger here?
  - This was caused by Reduce2 counting the fix-up atoms in flip Movers during overlap calculations.
  - Removing fix-up atoms dropped largest clique size to 8. Reduce is not taking probe radius into account when determining clique overlap, so still produces slightly smaller cliques.
On 9/5/2023 after optimizations made through passing references to functions: dropped to 116 seconds. Add hydrogens 19, interpret 15, optimize 79 (2x: place Movers 11, interaction graph 1.8, singletons 1.5, cliques 22). Fraction of atom scores calculated: 0.37.
- _PlaceMovers takes 13% of its 21% in its function body, with the fine-rotation placement of rotators taking most of the rest, the vast majority of which is RotateAtomDegreesAroundAxisDir().
On 9/7/2023 after optimizations made through faster hydrogen lookup: dropped to 98 seconds. Add hydrogens 18, interpret 14, optimize 62 (2x: place Movers 3.8, interaction graph 1.7, singletons 1.5, cliques 21). Fraction of atom scores calculated: 0.37.

russell-taylor commented 1 year ago

4fen has a number of 6-way and 7-way cliques caused by NCO and it ends up taking 15 minutes to optimize these along with the other single-hydrogen rotators. We cannot compare with original Reduce because it does not add NH3s to the NCOs.

The singletons and 2-Mover cliques are fast to optimize.
The 7-mover cliques take much longer.
Each 6-Mover clique took longer than a single 7.

russell-taylor commented 1 year ago

(done) Robert's note on rotate around axis call being slow for speeding up 1xso:

Running mmtbx.reduce2 and attaching the visual studio debugger to the process it seems like the code is spending a lot of time in the function _rotateAroundAxis(atom, axis, degrees) that lives in Movers.py.

It looks like the functions it is using could be substituted with functions such as scitbx.matrix.rt_for_rotation_around_axis_through( point, angle)

I haven't tried it myself but their internals have been boosted in C++ so should be orders of magnitudes faster than doing it explicitly in python code.

Switching to using this for just the offset rotation when wrapped in the following way made the code slower (from 21 to 24 seconds total) and gets the same answer:

  m = scitbx.matrix.col((0.0, 0.0, 0.0))
  newOffset = m.rt_for_rotation_around_axis_through(
    point=scitbx.matrix.col(axis[1]), angle=degrees, deg=True) * rvec3(offset)
  return nearPoint + lvec3(newOffset)

The whole function is taking a long time, but the individual call to rotate_around_origin was not a large portion of it. Will need to figure out how to do more of the work in that one function and do fewer conversions.

The rt_for_rotation_around_axis_through() function is defined in python rather than native code and it does some local math and then builds a rotation matrix.

russell-taylor commented 1 year ago

When we do the whole thing after position determination using:

ctr = scitbx.matrix.col(axis[0])
r = scitbx.matrix.col(axis[1]).axis_and_angle_as_r3_rotation_matrix(angle=degrees, deg=True)
ret = scitbx.matrix.rt((r, ctr - r*ctr)) * scitbx.matrix.col(pos)

it takes 18 seconds rather than 21 and gets the same answer.

This function remains the bulk of the time for 1xso, with most of the time now being taken by matrix multiplications and around 1/3 by the axis_and_angle function, which calls other Python functions. The matrix multiplications also look like Python functions, so not sure why he thought that these would be native code. The scitbx/math/init.py also seems to be Python rather than C++, so not sure why the comment said it was native...

russell-taylor commented 1 year ago

The scitbx.matrix documentation says that the scitbx.math module provides faster C++ alternatives to some algorithms here, so we should check there next. Otherwise, maybe we need to drop this into C++? First check other optimizations and see if they are using time elsewhere.

The scitbx.matrix and scitbx.math libraries seem to be Python all the way down. Asked Nigel and Oleg whether there is a native-code replacement for this and if not where I should add it.

Nigel responded with: Check out cctbx_project/scitbx/math/rotate_around_axis.h. It looks like it is implementing generic rotation along with usage of sin/cos precalculated tables to make it even faster. I think this will be a good place for your function as well if for some reason existing ones do not work.

There is actually a rotate_points_around_axis() function call there that loops over a flex array of points, which should me much faster for the cases where we're rotating a lot of points, but neither it nor the rotate_point_around_axis() from that header seem to be called from Python code and I can't just use it as scitbx.math.rotate_point_around_axis().

Calling scitbx.matrix.rotate_point_around_axis() is about as fast as the function above. There is a comment in the file that defines it that says that the C+ code is slower because of the tuple-vec3 conversion overhead. However, when we are calling this function we're having to do lots of tuple conversion, which is slowing us down because it takes and gives tuples and we actually want vectors. However, it looks like this function is not mapped in scitbx_math_ext.

(done) We're also subtracting and normalizing the second axis point and then adding it back in and undoing it and then the function we call is doing it all over again, so we can speed things up by changing the parameters to the function (but this will make us do extra work in the 3-point docking code and the test code, which actually has a directional axis).

Wrote a C++ function that adjusts the parameters and calls the templated C++ function in scitbx/math.

Improvements through 8/25/2023 made this take 15 seconds wall-clock run from the command line.

Improvements through 8/31/2023 (custom rotator C++ function wrapped just for this rotator) made this take 13 seconds wall-clock run from the command line.

Improvements through 9/4/2023 (custom rotator C++ function wrapped just for this rotator) made this take 13 seconds wall-clock run from the command line.

russell-taylor commented 1 year ago

(done) 4fen

Modified _BruteForceOptimizer._optimizeCliqueCoarse() and _CliqueOptimizer._optimizeCliqueCoarse() to not set a Mover state if it is already in that state. This did not seem to improve the fractions. Modified the FastOptimizer to track and report the number of cached vs. calculated atom scores.

This dropped the time from around 5 minutes working time to around 3 minutes for 4fen. It had a calculated/total ratio of about 30%.

(done) Why do the times not still show large counts for the optimization even though that took up the vast majority of the time? And why does the calculation take more like 15 wall-clock minutes? Did things stop early? Running again got answers closer to the initial run.

(done) Is it getting the same answer? Yes.

Running the original way: wall clock time was 13 minutes and 44 seconds. User+sys time was 120 seconds. The resulting optimization has the hydrogens in the same locations, at least for the NCOs. The new approach takes 12 minutes and 21 seconds, so it is a bit faster and we'll leave it in.

(done) Less than 2% of the time was spent in _PlaceMovers for this run. 40% of the entire time is spent in the _SingletonOptimizer._scoreAtom() function, which is the vast bulk of the time spent in FastOptimizer._scoreAtom() so the caching is not taking a lot of time. 20% and 20% of the time are spent in BruteForceOptimizer._optimizeCliqueCoarse() and _CliqueOptimizer._optimizeCliqueCoarse(), counting for another 40% of the total time. A total of 93% of the time was spent in _optimizeCliqueCoarse().

After moving InteractionGraph calculations into C++ and using i_seq dictionary lookups. After switching PositionReturn type into C++. 95.5% of the atom-scores were cached. 79% of the time was spent in the non-cached _scoreAtom call and 6.5% in the cached one. The next-largest functions were _scorePosition (2.5%) and _setMoverState (2%). Total wall-clock time dropped to 7 minutes, 49 seconds. The bulk of that time is spent optimizing the size-7 and size-6 cliques, with irreducible size-5 and size-6 cliques in them.

(done) Why are the size-6 cliques irreducible? These are the NCOs. The NH3 hydrogens for opposite-side Movers should not overlap, but we may have the two far nitrogens overlapping when we have a large enough probe radius. The problem was a mistaken attempt to reduce the maximum clique size test. After fixing this, the wall-clock time is 7 minutes, 22 seconds and the fraction of cached scores was 66%, 37% of the time was in the non-cached _scoreAtom and 1% in the cached. 27% was in optimized _optimizeCliqueCoarse and 27% in brute-force _optimizeCliqueCoarse (most of its extra work being in the function body).

When we make the score_dots() C++ function return a valid return with all 0's, the size-4 cut-graph calls times dominate the run time, followed by the size-3 cut-graph calls. Command-line wall-clock time is 4 minutes 58 seconds and profiling shows the recursive and brute-force _optimizeCliqueCoarse sharing the bulk of the run times, with 43% in the function body of the optimized and 43% in the body of brute-force. This is because we're basically terminating the recursion for clique-size 1 almost all of the time.

(done) When we implement the brute-force and vertex-cut clique optimization in C++, we get a speedup of about 2.5X, with a total run time of 3 minutes, 6 seconds. (This code is still using some Python data structures internally, but the timing shows this is fairly insignificant.) 34% of the atom interactions are calculated, so caching is not helping us as much for this model, with 83% of the total run time spent in _scoreAtom. Only 1% of the time is being spent exclusively in _optimizeCliqueCoarse (less than inclusively adding hydrogens or reinterpreting the model or the interaction graph and much less than rotatable_hd_selection). Around 2% is being used in the caching _scoreAtom and 83% in the main one.

(done) Removing the bonded heavy-atom neighbor from rotatable hydrogens (had been added in reduce2 and is not in reduce) got the speed down to 56 seconds. Realized that these must be put back in. Improvements through 8/25/2023 made this take 3 minutes, 3 seconds wall-clock run from the command line.

Improvements made through 8/30/2023 made this take 2 minutes, 11 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements made through 9/4/2023 made this take 2 minutes, 8 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements made through 9/6/2023 made this take 2 minutes, 4 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements mad through 9/8/2023 made this take 48 seconds (31 seconds optimizing cliques).

Improvements mad through 9/11/2023 made this take 30 seconds (22 seconds optimizing cliques, mostly in the Python code).

russell-taylor commented 1 year ago

(done) Model 1rrr takes 54 seconds wall clock, so using it to try and optimize for medium-sized models. It took 70 seconds when run in the profiler.

39% _AddHydrogens
- hydrogens.reduce_hydrogen.run(), ~half model.select() and ~half model.process()
14% _ReintepretModel
- model.process()
40% optimization
- 38.38% mmtbx.model.model.rotatable_hd_selection, hydrogens.rotatable(), hydrogens.helper_2(), scitbx.graph.tardy_tree.init, find_paths(), _search_from(), depth_first_search. This is building a structure that describes the behavior of clusters inside a graph, including determining which ones can rotate around an axis.

There are twenty single-hydrogen rotators, all singletons.

From Dorothee: I added checks for two select() calls. With this, 1RRR runs 5s faster for me. Unfortunately, it won't speed up things for 3j4p, as this example requires the two select() calls. It may be possible to consolidate two of the select calls, but it needs testing. Will keep you posted.

This dropped 1rrr runtime to 39 seconds on the command line.

Improvements through 8/25/2023 made this take 35 seconds wall-clock run from the command line. The new profiler run shows:

45% _optimization (almost all in model.rotatable_hd_selection)
27% _AddHydrogens
16% _ReinterpretModel

Improvements made through 8/30/2023 make this take 35 seconds wall-clock time from the command line. the new profiler run shows:

47% optimization (almost all in mmtbx.hydrogens.rotatable)
27% _AddHydrogens
16% _ReinterpretModel

Speeding up the determination of single-hydrogen rotators sped up the optimization and reduced the overall time to 18 seconds. We'll be removing _ReinterpretModel, at which point it should become even faster. Indeed, on 9/26/2023 we get 13.45 seconds without reintepretation (11 seconds to add hydrogen, 2 to optimize: 0.13 to place on each model).

Model 3j4p also takes the bulk of its time in optimization, with most of its time on coarse-clique optimization (mostly in _scorePosition, mostly in _scoreAtom). When we modify mmtbx_probe_ext:DotScorer::score_dots() to return a valid result of all zeroes, the command-line time drops to 30 seconds from 104 and the profiling shows 2% of the time in _scoreAtom (down from 50+%). There is still 8% in _scorePosition, so we may have some room there but the vast bulk is in the actual C++ calculations.

russell-taylor commented 1 year ago

@todo: Removing the unrecognized ligands from 7k00 to see how long Reduce2 takes to run on a large model. Reduce takes around 20 seconds to optimize 7k00, but it does not have cliques that are as large.

7k00 takes a long time in hydrogen addition (presumably because it is so large). Slightly more than half of the time is in mmtbx.model.model.process, mostly in setting up restraints. 17% is in mmtbx.model.model.select (and another 9% calling this inside another function). Cryo-EM structures are usually bigger.

When taking out the unrecognized ligands 7k00 it takes more than 12 hours, with the vast majority of the time spent in clique optimization. Original Reduce had one clique of size 5 and nine 4s and most of size 2 but reduce2 had at one of size 11, one of size 10, and twenty-five larger than 4. Reduce had 1302 Movers in cliques and 5068 singletons (6370 total Movers). Reduce 2 had 1853 Movers in cliques and 4442 singletons (6295 total Movers). Reduce2 was unable to place a large number of hydrogens due to insufficient restraints, which could explain why there are fewer Movers.

Replaced the internal code for DotScorer::score_dots() with a simple return of 0 scores and valid response to remove all of the time spent in the C++ scoring code so that we can just count the Python time, including wrapping/unwrapping as part of the function calls. The clique solvers are taking several seconds for the larger-than-4-sized cliques, and even some of the 3-4-sized ones are a bit slow; one of the 7-cliques took many minutes. Running in 1xso verified that the optimization results are consistent with scoring not being done. This implies that for large cliques, the atom shuffling and Python book-keeping is a significant portion of the solver time above and beyond the C++ code. Clique opt for 1xso without C++ scoring jumpered out took 0.094 wall-clock seconds. Putting the code back in took 1.148 seconds, so the time for small cliques is dominated by the C++ code time. For 4fen (several 6-7-Mover cliques), optimization time with jumpered-out C++ was 262.563 seconds and with all calculations was 428.386, indicating that the Python code is taking ~60% of the time for medium-sized cliques.

(done) Consider pulling the clique-optimizing code into C++. Pulling the PositionReturn class into C++ seemed to speed this up tremendously.

(done) Constructing the interaction graph takes minutes. This could make it worth checking whether just the atom shuffling is slowing us down a lot (consistent with clique optimization being slow even without the C++ calculations being done). Moved the _PairsOverlap into C++ and used seq_id rather than atom as the dict lookup made this much faster, on the order of the re-interpretation time and faster than the hydrogen-placement time.

With optimizations made through 8/11/2023, the total time dropped to around 81 minutes (down from 12+ hours). The fraction of atom scores that were calculated is 0.02 (98% cached). The functions taking the most time were not properly reported by the profiler; the vast majority is in clique optimization.

With optimizations made through 8/21/2023, the total time dropped to 34 minutes. The time is evenly split between selection of rotatable hydrogens, time to place Movers, and time to optimize cliques. The fraction of calculated atom interactions was 0.21.

With optimizations made through 8/25/2023 (including putting back the tests for heavy neighbors of rotating hydrogens), it completes in under 42 minutes. Fraction of calculated atom interactions was 0.21. Adding hydrogen 194, interpret 132, optimize 2212.

With optimizations made through 8/30/2023 (including pulling objects, methods, and caches into C++ and caching the interacting dots per atom and position), it completes in under 34 minutes. Add hydrogens 184, interpret 132, optimize 1731. Another run's optimization: time to select rotatable hydrogens 580, place Movers 506, interaction graph 128, singletons 36, cliques 615, fine 13, fixup 4, total optimize 1916. With changes made through 8/31/2023, optimization was select rotatable hydrogens 408, place Movers 440, interaction graph 124, singletons 36, cliques 596, total 1655. (Speeding up the atom rotation did not seem to speed up Place Movers vs. select rotatable hydrogens.) Run on 9/4 after new opts: Add Hydrogen 196, Interpret 140, Optimize 1632 (select rotatable 403, place Movers 441, interaction graph 120, singletons 35, cliques 584, fine 12, fixup 5).

_AddHydrogens drill-down: mmtbx.model.model.process 28%, mmtbx.model.model.select 8%.

With optimizations made through 9/6/2023 (including faster single-hydrogen rotator finding), it completes in under 21 minutes: Add hydrogen 188, interpret 136, optimize 890 (select rotatable 0.2, place Movers 63, interaction graph 100, singletons 37, cliques 613, fine 12, fixup 5). The bulk of the time is in C++ Mover optimization. Fraction calculated is 0.29, so we're probably seeing a bottleneck inside score_dots(). Making that function return immediately dropped singleton time to 12 seconds and clique time to 20 seconds. Making the inner-loop check_dot() function return immediately dropped singleton time to 21 seconds and clique time to 124, so most of the time is in the inner function but a significant fraction is in the outer.

Individual functions: 51% Optimize, 6.5% InteractionGraphAllPairs, 6.3% update_plain_pair_sym_table (part of computing restraints), 3.2% Movers._posesFor (rotate around axis), 3.0% get_sorted (part of getting restraints).

With optimizations made through 9/11/2023 (including putting more data into ExtraAtomInfo and using vector storage in ExtraAtomInfoMap), it completes in under 12 minutes: add hydrogen 191, interpret 135, optimize 352 (place Movers 61, interaction graph 97, singletons 13, cliques 140, fine 4, fixup 5).

_ReinterpretModel will be removed once Oleg fixes the selection code so it does not invalidate interpretation results.
Individual functions: InteractionGraphAllPairs 14.86% (4.27 is _InteractionGraphAABB); 11.9% _PlaceMovers (7.45% Movers init [5.31% _computeFinePositions {6.84% _posesFor}])

We have the same results for 7k00 on 9/20/2023 after the Interaction Graph speedups. Wall-clock time under 11 minutes: add hydrogen 190, interpret 125, optimize 262 (place Movers 61, interaction graph 31, singletons 12, cliques 114, fine 3, fixup 5).

On 9/21/2023 after making the AtomMoverLists C++ class, wall-clock time was around 10 minutes: add hydrogen 200, interpret 135, optimize 242 (place Movers 62, interaction graph 2.8, singletons 13, cliques 119, fine 3, fixup 6).

On 9/26/2023 after removing reinterpretation, wall-clock time of 7 minutes, 30 seconds: add hydrogen 197, optimize 249 (place Movers 65, interaction graph 2.9, singletons 12, cliques 120, fine 3, fixup 5).

On 10/2/2023 after making much sparser single-hydrogen orientation checks, wall-clock time is 5 minutes, 24 seconds: add hydrogen 197, optimize 99 (place Movers 42, interaction graph 0.3, singletons 4, cliques 5.5, fine 4, fixup 5).

russell-taylor commented 1 year ago

(done) Both 7k00 and 1zz0 have much larger clique sizes for Reduce2 than Reduce, even when there are fewer total Movers.

(nope) One hypothesis is that there are more Movers added to ligands in Reduce2, which are bridging cliques. This is disproven below.
(nope) Reduce2 is definitely making cliques between chains in 7k00, but it looks like Reduce is also doing this.
The 11-Mover clique in Reduce2 is a A 2635, a C 2636, a U 2784, a C 2785, d THR 35, d GLN 36, d LYS 38, d ASN 42, d TYR 45, d GLY 49, d HIS 67
- The corresponding cliques for Reduce are 99 (5): a A 2635, a C 2785, d THR 35, d GLY 49, D HIS 67; 68 (3): a C 2636, d LYS 38, d TYR 45; 100 (2): a U 2784, d GLY 36; and a singleton d ASN 42.
- (done) When we draw spheres whose radius matches the atom radius, they seem visually to be way too large (at least for the Hydrogens). They are 1.05 and 1.22. Carbons are 1.65 and 1.7 and 1.75, O is 1.4, N is 1.55. Nothing is added when we're doing just the Movers. These seem numerically correct. When I measure on the screen with my finger width, they seem to geometrically match those numbers. They were being drawn for every atom at every other atom's position, which was fixed later.
- (done) HIS d 67 seems to have spheres in it all the way down to the hydrogens hanging off CB. Indeed, these are included in the Mover's atom list because they are included in the fixup behavior (they are ahead of the CA). Consider adding fewer positions than atoms so that only the ones that are moved are placed into the coarse and fine positions. Be sure all of the loops use the position list size rather than the atom list size. This made the largest clique size in 1zz0 be 8 rather than 23!
- We still have an 11-way clique in 7k00 for Reduce2. Take a closer look at it and see whether all of the links make sense and figure out why we don't see them in Reduce. When the Movers are drawn without the extra 0.25 probe radius, they overlap in exactly the clique structure selected by Reduce. It looks like Reduce is ignoring probe radius when determining cliques.
  - When we ask Reduce2 to use a probe radius of 0, the time drops from under 12 minutes to under 10 minutes, with two 5-way cliques and then 4's and down (which is similar to Reduce). The coarse clique optimization time drops from 140 to 51 seconds and singletons take 12 seconds.

russell-taylor commented 1 year ago

(nope) Try only creating the fine positions when they are asked for in the MoverRotator, rather than pre-computing all of them.

No, because we use all of them to determine the interaction graph we end up needing to compute all of them, so this will only increase the time.
If we remove these from the interaction graph, it will be faster to compute only the ones for the final tests.

russell-taylor commented 1 year ago

(nope) Dropping the coarse resolution for NH3 rotators back to the original (30 vs. 15) will speed up cliques, but will be less precise -- there were differences between Reduce2 and Reduce at Reduce's original stride. Single-hydrogen rotators have a value of 10 in both Reduce and Reduce2.

Switching this for 1zz0 changed the time from 110 seconds total (76.4 opt) to 110 (77.2 opt), so we're not seeing an impact on speed for this molecule.

russell-taylor commented 12 months ago

(done) Pass references rather than copies into InteractionGraph.h routine.

Before: 1zz0 took 1.98 and 2.04 on 1zz0
After: 1.67 and 1.68

russell-taylor commented 12 months ago

(done) Can we find a faster way to locate rotatable hydrogens for our single-hydrogen rotators?

Look at single hydrogens that are bonded to a neighbor that itself has a single other neighbor
- allRotatableHydrogens is also returning the NH3 rotator hydrogens -- make sure we're not putting multiple rotators there.
- Needed to ensure that the other neighbor was not also a hydrogen (to avoid HOH), then seems to work. Tested with 1xso, 7k00, and 4fen.

russell-taylor commented 12 months ago

(done) Single-hydrogen rotators going all the way around rather than only pointing at potential acceptors and one other location. The original Reduce did not include the full coarse set.

Will slow down optimization, especially for cliques
Will potentially impact clique overlap, making larger cliques

1yk4 is a good test for a single-hydrogen optimization model. It took me under 43 seconds wall-clock time when I ran on this file. Almost all of that time was optimization. Almost all of that time was clique optimization. There is a single 3-way clique of single-hydrogen rotators in each of 3 alternate conformations:

  Set of 3 Movers:  Totals: initial score 7.68, final score 17.35
   SingleHydrogenRotator at chain A CYS 6 HG Initial score: 0.64 final score: 4.45 pose Angle 135.0 deg . .
   SingleHydrogenRotator at chain A CYS 39 HG Initial score: 3.35 final score: 7.43 pose Angle 131.0 deg . .
   SingleHydrogenRotator at chain A CYS 42 HG Initial score: 3.69 final score: 5.47 pose Angle -37.0 deg . .

(done) Consider looking for the coarse orientation that is the largest gap with the nearest atom at that orientation (an estimate of the least-clashing angle) that is not near an acceptor. Fine rotations will happen around this point out to the distance of half of the coarse step size in each direction.

This drops the time to less than 3 seconds for 1yk4
This drops the time for 7k00 to around five minutes, drastically reducing clique time and reducing placement and interaction graph calculation.
It does not check as many orientations, so causes a test with a nearby Oxygen to fail because it is not close enough to form a hydrogen bond so is not listed as an acceptor.

(nope) Consider increasing the coarse spacing to 60 degrees (up from 10) on the assumption that it will probably find a good non-colliding direction that way while still always finding a close acceptor when needed because of the explicit search. The fine search will cover a range of +/-30 degrees once the coarse position is chosen (including if an acceptor angle is chosen).

This finishes 1yk4 in 4.5 seconds but it may not get as good of an answer.

(done) Rerun new_reduce_regression script to ensure we're getting good results.

1crn is worse: better H-bonds but worse C and N contacts (but these were also present in previous version)
1ehz is worse
1xso is worse: same characteristics as 1xcrn
6hxv is worse
7c31 is worse

Modified the code to pick the orientation that has the best contact (closest to just touching) with any nearby atom, even an Acceptor, rather than picking the orientation that is the furthest from contact. This improves the scores compared to the furthest contact, and made some of them better than Reduce scores, but some are still worse.

1ehz is now better
6hxv is now better

(done) Run comparison using 1xso and perhaps other models against original Reduce to check Mover-by-Mover behavior to find out which are worse. (done) How can this be even though theMover-by-Mover comparison shows better summed scores for reduce2 compared to the one saved by running original reduce on my laptop (some are worse by up to 1 part in 6, some better by this much or more)? The new_reduce_regression is checking for 4-long bonded neighbor chains in all cases (as expected). Rerunning to compare against the output PDB generated by that script to see if the same thing happens shows that it does. It looks like the improvements made by the Mover hydrogens are offset by the better C and N contacts from the original model, which must be from non-rotatable hydrogens? However, the hydrogen bond lengths are the same for the old and new models and there are only very slight shifts in angles between them for the non-rotatable hydrogens. The C and N contacts and overlaps are pretty much the same for both new approaches (total check and single check), with the O contacts being better for both new (even better for total check). The differences are about 1 part in 100 of the total scores, so they are very close to each other in all cases.

Spot check of single rotator OG1 THR A 98 shows better contacts with reduce2. (There are several amide flips in different orientations and some NH3 groups at different orientations.)

(done) Consider adding more tests at the original coarse orientations, but only when the new orientation is at least N degrees away from a previous choice (including best touch and any potential acceptors and already-added ones). This would only add more tests to cover additional areas and may not take too much extra time.

The 1xso with reduced tests removes two pairs of single-hydrogen rotator cliques, making them independent, so adding these all around may re-establish those.
7k00 still takes only 5 minutes 24 seconds, so it did not add much time
This added the single-hydrogen rotator cliques back into 1xso that were there for the all-check case
This made the new_reduce_regression tests match better/worse behavior of the all-test code except for 1xso (which was 0.5 worse out of 2145) and 7x31 (0.1 worse out of 648).

russell-taylor commented 12 months ago

(done) Calls to score_dots(), with its own calls to check_dot() are taking the bulk of the time in 7k00, with check_dot() taking a larger fraction of the time. Looking for a model system that runs faster to try and optimize this, so using 1zz0 (which takes 96 seconds total, with initially 1.5 seconds singleton and 21 seconds cliques.

Making check_dot() return immediately drops this to 0.5 and 5.1
Making it return just before the first loop at line 150 takes 0.74 and 9, probably a lot in element_is_ion() (see below)
Making it return after the first loop determining keepers takes 1.47 and 20.7, so the work is happening in there.
- Making it continue in that loop before checking if this is the smallest gap takes 1.46 and 20.1, so most is before this.
- Making it continue at line 164 before checking too far away takes 1.51 and 20.4, so most is before this.
- Making it continue at line 160 before getMappingFor() takes 1.35 and 18, so most is before this.
- Making it continue at line 157 before element_is_ion() takes 0.7 and 8.8, so a lot of it is in element_is_ion()
  - (done) element_is_ion() does a bunch of string comparisons on the atom names, so we'll want to pull that into ExtraAtomInfo and look it up only once per atom, rather than in the inner loop (we're already checking for that to get its radius, just not storing it).
    - After switching these to be stored in extra atom info, took 0.73 and 7.9, so this was the bulk of the time in check_dot()
    - (done) Add to unit testing and documentation
After above changes and a radius change, making score_dots() return before the loop over the dots takes 0.76 and 8 seconds
- Making check_dots() return immediately takes 0.76 and 8 seconds, so the loop handling is effectively free
- The full run time is 0.79 and 9.3 seconds with all code run
- Returning at the beginning of score_dots() takes 0.31 and 7.3, so for cliques Python takes 7.3, score_dots takes 0.7 before calling check_dot(), which takes 1.3.

With the above changes, 4fen goes from 2 minutes 4 seconds wall-clock time (mostly optimizing cliques) to 48 seconds (31 seconds optimizing cliques, of which 2 is Python) while getting (as expected) the same answer.

With the above changes, 7k00_trim goes from under 21 minutes wall-clock time (613 optimizing cliques) to 14 minutes 17 seconds (257 optimizing cliques).

The atom_charge() function is also doing string manipulations inside the check_dot inner loop, though much simpler ones than element_is_ion().
- Making this function always return 0 dropped the clique time for 1zz0 from 9.3 seconds to 8.7 seconds, so it looks like pulling these into ExtraAtomInfo will also be worthwhile.
- Pulling this into ExtraAtomInfo made 1zz0 clique optimization go from 9.3 seconds to 9.0 seconds and 4fen go from 36 seconds to 35. The exact values change which path the code follows, so it looks like the actual code explores a slower space.
When we return immediately from check_dot(), coarse takes 16.9 seconds for 1zz0.
- When we return before the for loop looking across interacting, coarse takes 17.8, so most of the work is in the loop
- When we continue at the top of the for loop across interacting, coarse takes 17.5, so most is below here
- When we continue at line 162 before the distance calculations, coarse takes 29 seconds, so most of the work is being done above there. This means that most of the work is being done in the map lookup for the extra atom info.
- When we continue before the gap < ret.gap check, coarse takes 30 seconds, so most of the work is being done above there.

russell-taylor commented 12 months ago

Much of the remaining time in coarse optimization is in check_dot() when it calls getMappingFor() on the atoms to get their extra atom info.

(done) Consider whether the problem is that we're copying the structures out rather than passing references.

When we return a reference (by adding , return_internal_reference<>() and making it a reference), coarse goes from 35 seconds to 33.7 seconds, so it is a little faster but was not the major issue.

(done) Consider storing this mapping internally in a vector based on i_seq rather than in a map based on the atom's raw pointer to speed up the lookup. See if we still need to keep the pointers to the atom data in that case, I think not.

(done) tst_probe.py fails when mmtbx.probe2 is run against a snippet file. One cause of this will be the fact that getPhantomHydrogens() assigns the i_seq of the parent Oxygen to the child Hydrogen, so we'll have incorrect lookups.
- The code runs MUCH faster (20 seconds for coarse opt on 4fen) when using a vector.
- (done) We cannot set the i_seq of an atom from inside Python except by resetting all of the i_seqs for a model, so we can't make them all use an i_seq larger than that of other atoms in the model without adding a way to set this (which I recall strong resistance to from CCTBX folks). Maybe we add a special Probe function to do this in the BPL file.
- (nope) Also, some PDB files spit out by reduce have the new H atoms having i_seq values all 0, even though some of the added hydrogens have different characteristics than the others. Hmm... calling reset_i_seq() on these atoms does not seem to fix the problem. It looks like they already have well-sequenced i_seqs after being loaded, so this is not the problem.
- (done) What is causing the misbehavior? Tried giving each Phantom Hydrogen its own i_seq, sequentially incrementing them as each was added. There is something else running through all of the sequence from 0-end for many of the original atoms and not adding the Phantoms; must be the fixupExplicitDonors() function, but this should not mess with the mapping. (done) We were calling insert() rather than using [] to replace, so were not overwriting the old values when we reset a mapping. Need to add regression test and redo test code after fixing then try switching to a vector.
- (done) Needed to sort atom lists and sets in several places to get the code to be deterministic.
- (done) Uncomment the stdout = subprocess.DEVNULL in tst_probe.py
(nope) There is a std::unordered_map in C++11 that has constant-time access, but it is not available in C++-98. It drops the coarse time from 33.7 seconds to 25.6 seconds on 4fen. This is slower than the 20 seconds for vectors, but passes all regression tests. It may not compile on all architectures. Changing the max_load_factor from 1 to 0.1 did not affect performance.

The coarse-optimization code takes 21-22 seconds on 4fen when using a vector rather than a map.

russell-taylor commented 11 months ago

(done) Further speeding up of interaction graph, using 1zz0 as a fast optimization case

Initially, it took 1.7 seconds to compute the interaction graph
- 2.77% of the time in the body of InteractionGraphAllPairs
- 2% of the time in _InteractionGraphAABB (mostly in the body), so about evenly split between the two

We're calling CoarsePositions() and FinePositions() on each Mover, so could do this in Python and call C++ code that receives const references to af::shared::PositionReturn vectors. Both methods return a graph and one also a dictionary of sets of interacting Movers looked up by atom.

Do we want to pull the calculation of the ranges into C++?

After pulling the contents of the AABB construction into C++, we get the same answers and it takes 1.0 seconds to compute the interaction graph for 1zz0. This is the expected reduction if we took almost all of the time out of running the function.

Pulling the lookup codes from InteractionGraphAllPairs() into C++ didn't change the runtime, indicating that the bulk of that function's work is already being done in C++.

Removing the extraAtomInfo lookups dropped the time by more than a factor of 2, so most of the work is in there. But this operation should be very fast... so it must have been that we set it to 1.0, which caused fewer overlaps, which caused fewer additions to the dictionary, which must be the major cost.
Yes, removing the placement into the sets within the dictionary took the time to essentially 0. To speed this up further, we'll need to move the set code into C++ and then send the list results into Python, probably as an af::shared. About 0.2 of the time is the dictionary lookup and about 0.25 is the set addition call, with the set object type lookup taking the rest.
- Converting from a dictionary of sets to a list of lists drops the total time to 0.47
- Removing the appending of Movers to the lists dropped it only to 0.375, so most of the remaining work is still in conversion to a list type and Python list operations.
- Speeding this up further will require pulling this structure into C++ rather then using Python lists.

Making the AtomMoverLists into a C++ data structure got the time to compute the interaction graph down to 0.1 seconds.

russell-taylor commented 10 months ago

@todo: Very high memory use and long run time. Tom gave us a file that he was trying to run refinement on. It has 142,140 Movers. With Reduce, this file takes a few minutes and uses around 4GB of RAM. With the hydrogen placement part of Reduce2 it uses more than 180GB peak RAM; in total it runs for 20.5 hours (10,978 to add hydrogens, 61,898 to optimize: 44,963 place Movers, 5,663 dump atom info to string, 1,248 opt singletons, 1,413 opt cliques, 2,348 fixup), at least partly due to causing thrashing on a 32GB machine. The input file is at https://drive.google.com/file/d/1jRCUXHfC25kMbRM7MrK1gkNArGGKW7Cg/view?usp=sharing and it is 250MB in size.

Running on the single instance of 3kz4 shows 69 seconds to add hydrogens and 48 to optimize (33 place Movers, 3 for excluded atoms, 3 in Helpers.writeAtomInfoToString [dropped to 2.3 by reorganizing], 1 opt singletons, 1.4 opt cliques, 0.7 fine, 1.4 fixup); we’ll use it to try and find the hot paths.

_AddHydrogens: 51% reduce_hydrogen.run
- 38% mmtbx.model.model.process
  - 30% model._setup_restraints_manager: mmtbx.monomer_librar.pdb_interprtation.geometry_restraints_manager
    - 12% mmtbx.conformation_dependent_library.mcl.update
      - 6% cctbx.geometry_restraints.manafer.add_new_bond_restraints_in_place
      - 3% cctbx.geometry_restraints.get_sorted
    - 12% mmtbx.monomer_library.pdb_interpretation.construct_geometry_restraints_manager
      - 9% mmtbx.monomer_library.linking_mixins.process_nonbonded_for_links
  - 17% mmtbx.utis.process_pdb_files
- 5% mmtbx.model.model.select
- 3% mmtbx.model.model.setup_riding_h_manager
Optimizers.init: 36%
- _placeMovers: 25%
  - 10% in the function body: (done) Consider more efficient ways to find all targets than checking every atom.
  - MoverTetrahedralMethylRotator.init: _MoverRotator.init: 8%
    - _MoverRotator._computeFinePositions: 9.2% (presumably more because it includes calls from other paths)
      - 8% _posesFor: RotateAtomDegreesAroundAxisDir()
  - MoverSingleHydrogenRotator.init: 4%

russell-taylor commented 9 months ago

Speeding up _PlaceMovers:

Removing the resNameAndID call did not radically improve things
Making the follow-on ifs into elifs did not change things
- Removing the Histidine flip work using elif False didn't change the runtime by much or fraction hardly at all
Avoiding check for rotatableHydrogen unless the atom is a hydrogen dropped fraction to 21 from 25 (body to 7 from 10)
- (done) Don't check all hydrogens, just do those in the list directly in a separate loop. This gets a different location for four hydrogens than the original code does. All had 1 potential neighbor. First three were singletons, last was in a clique of size 2. The first three had different initial and final scores in the original and new version, along with different angles. Perhaps there were locked-down Histidine flips and tetrahedral placements done before we added these in the original code? Moving these placements after the other placements resulted in a large number of changes compared to both the original and the other new code.
  - A check of several showed that these were singleton single-hydrogen rotators or in cliques with them; (done) D LYS 397 was in a different order and 1-degree different after coarse optimization, which is very puzzling.
    - Checking 1xso against the old version turned up a large number of different orientations for single-hydrogen rotators... there was only a single SingleHydrogenRotator found for the new run; It failed to add SingleHydrogenRotators to three HOH residues because of list index out of range; many of the elements for the specified indices were not hydrogens.
    - Fixed 1xso by looking up the atoms by i_seq rather than indexing the list itself (which has a subset of the total atoms). Once this was accounted for, the old and new PDB files look identical.
    - The C THR 151 HG1 single-hydrogen rotator paired with D LYS 397 was in the identically same location in both runs. The score was slightly better in the original (-164 degrees) than the new2 (-165 degrees).
      - When we run with only a local neighborhood, the NH3Rotator stays locked down and the SingleHydrogenRotator goes to a different angle. This is presumably due to the rounding differences when the transformations are different in the local-neighborhood selection code.
      - When run with only a local neighborhood and put the SingleHydrogenRotators in first, we get the same answer for this clique as the original code, so it has to do with the order they are added.
      - When switching the order of fine optimizations, we get different answers for the local-neighborhood snippet. So fine optimization order has to do with it, but is presumably not the only factor.
    - When we remove fine optimization, the differently-ordered insertions (SingleHyd before or after others) are much reduced -- only CYS 96 is different in D,E,G,H,J,K,N; they had different initial scores. Looks like fine optimization order is causing our differences in cliques. Probably tetrahedral placement causing different initial conditions in other cases.
- This dropped the _PlaceMover time for Tom's file to 4103 seconds, down from 44963 and the total time to 9.6 hours (less than half).
- Only computing _resNameAndID() rather than for every atom somehow made things a bit slower? In any case, it is a small fraction so we're leaving it alone.
- (done) This change made 1xso worse on reduce2 than reduce. Look into which Movers are worse and why (overall, 6 of the fragment files in new_reduce_regression are worse with this change).
  - Running the Mover-by-Mover comparison on the reduce2 file and the original reduce output shows that the total differential score is +43 on reduce2, with smaller minimum low scores and larger high score differences, with many more higher in reduce2 than lower. Presumably, this means that the small differences in fixed hydrogen placements are summing to a larger negative offset. The summary scores from Probe2 were 2145.8 vs. 2140.0, so very close. The total contact score for Reduce2 was slightly higher at 1838.2 vs. 1833.9, consistent with the higher Mover score; the improvement was in Oxygen contacts.
  - The probe2 score for 3kz4 was slightly better overall (slightly worse on some N contacts) for the new code that adds SingleHydrogenRotators last and recomputes the initial scores for fine optimization than the code from before we started doing speedups. It looks like we've improved the results.

russell-taylor commented 9 months ago

(done) Allow the command line to remove the generation of the atom-dump string, which is taking a large fraction of the total run-time for this file..

Removing this drops the run time by 94.5 minutes.

russell-taylor commented 9 months ago

(done) Re-running Tom's file after the above changes were made resulted in a 30,758 second time to add hydrogens and a 9,908 second time to optimize (4298 place Movers, 1141 Singletons, 1300 cliques, 2446 fixup). The hot functions were:

cctbx.geometry_restraints.manager.update_plain_pair_sym_table: 37%, .manager.pair_proxies: 15%, .get_sorted: 11%, .manager.reset_internals: 8%.
_PlaceMovers: 9%, almost all in _posesFor (fixed as described below).

The call tree has most of the time in reduce_hydrogen.run:

mmtbx.model.model.process: 59%
- mmtbx.monomer_library.pdb_interpretation.geometry_restraints_manager: 57%
  - mmtbx.conformation_dependent_library.mcl.update: 39%
  - mmtbx.monomer_library.pdb_interpretation.construct_geometry_restraints_manager: 11%
mmtbx.model.model.select: 25%
- cctbx.geometry_restraints.manager.update_plain_pair_sym_table: 22%

(nope) Replacing the Rotate*DegreesAroundAxisDir() calls to use the fast sin/cos routines make 3kz4 Mover placement take 17.8 seconds. When using sin/cos, it takes 16.7, so we're not winning by swapping out this call. Inlining the setup function and computing parameters at compile time made the speed 16.3, so not worth the trouble. Switched to an approach that precomputes everything including the references and uses static double arrays gets it to 15.9 seconds, so still probably not worth the trouble. In fact, the variation seems to be in the noise. Calling the rotate_point_around_axis() function twice to make sure it is what is taking the time: nope, it looks like it took 17.559 seconds, so it is not the issue? Running this again inside the Python profiler to check; took 6;88% in the hot path with NH3 rotators when called once, and 8.23% when called twice; wall-clock time goes from 16.9 to 18.9 seconds (about 1/8th more). On the other hand, returning just the atom coordinates without transforming takes 16.7 seconds (about the same), so it looks like the rotation function itself is not taking the time.

(done) Consider writing _posesFor() in C++. Replacing the body of _posesFor() with a call to a C++ FindPosesFor() function gives the same answer on 1xso. For 3kz4 it takes 6.576 seconds to place Movers the new way and 16.227 the original way; the resulting PDB file is the same in both cases. Running inside the performance profiler shows that the work done during optimization for the original approach to take 9% of the total time in _posesFor; with the C++ function it was 0.87%.

russell-taylor commented 3 months ago

Looking into speeding up mmtbx.conformation_dependent_library.mcl.update:

Running mmtbx.hydrogenate on 3kz4.pdb took 135.46 seconds (over two minutes) the first run and 127.63 the second.
Modifying the function to not append data to the outl and outl_debug strings (to avoid formatting names and sorting) took 126.21 seconds, so this is not taking any time.