ReliaSolve / cctbx_project

Computational Crystallography Toolbox
https://cctbx.github.io
Other
0 stars 0 forks source link

Found files that are slow to optimize in Reduce2 #188

Open russell-taylor opened 2 years ago

russell-taylor commented 2 years ago

This file reports an irreducible clique of size 4 that has to be reduced using brute force. Reduce running on this file is very fast (fraction of a second) because it does not have any cliques, only singletons (even with -rad0.5 or -rad1.0). Reduce2 is many minutes (1909 seconds)

Fe_1brf_rubredoxin_0.95A.pdb (1brf.pdb)

https://app.zenhub.com/files/298640964/713768fa-a6c8-447a-9f2d-f15137ccc4cc/download

The residues are in chain a: Cys 5, Cys 8, Cys 38, Cys 41

Made separate issue for this: https://github.com/ReliaSolve/cctbx_project/issues/256

russell-taylor commented 1 year ago

(done) An incredibly slow file (more than 2 days) is 1zz0.pdb. Original Reduce takes several seconds to process the file. It has 30 cliques, with the largest of size 6 (at least 3 of them) and more total singles than the total Movers within cliques.

russell-taylor commented 1 year ago

4fen has a number of 6-way and 7-way cliques caused by NCO and it ends up taking 15 minutes to optimize these along with the other single-hydrogen rotators. We cannot compare with original Reduce because it does not add NH3s to the NCOs.

russell-taylor commented 1 year ago

(done) Robert's note on rotate around axis call being slow for speeding up 1xso:

Running mmtbx.reduce2 and attaching the visual studio debugger to the process it seems like the code is spending a lot of time in the function _rotateAroundAxis(atom, axis, degrees) that lives in Movers.py.

It looks like the functions it is using could be substituted with functions such as scitbx.matrix.rt_for_rotation_around_axis_through( point, angle)

I haven't tried it myself but their internals have been boosted in C++ so should be orders of magnitudes faster than doing it explicitly in python code.

Switching to using this for just the offset rotation when wrapped in the following way made the code slower (from 21 to 24 seconds total) and gets the same answer:

  m = scitbx.matrix.col((0.0, 0.0, 0.0))
  newOffset = m.rt_for_rotation_around_axis_through(
    point=scitbx.matrix.col(axis[1]), angle=degrees, deg=True) * rvec3(offset)
  return nearPoint + lvec3(newOffset)

The whole function is taking a long time, but the individual call to rotate_around_origin was not a large portion of it. Will need to figure out how to do more of the work in that one function and do fewer conversions.

The rt_for_rotation_around_axis_through() function is defined in python rather than native code and it does some local math and then builds a rotation matrix.

russell-taylor commented 1 year ago

When we do the whole thing after position determination using:

ctr = scitbx.matrix.col(axis[0])
r = scitbx.matrix.col(axis[1]).axis_and_angle_as_r3_rotation_matrix(angle=degrees, deg=True)
ret = scitbx.matrix.rt((r, ctr - r*ctr)) * scitbx.matrix.col(pos)

it takes 18 seconds rather than 21 and gets the same answer.

This function remains the bulk of the time for 1xso, with most of the time now being taken by matrix multiplications and around 1/3 by the axis_and_angle function, which calls other Python functions. The matrix multiplications also look like Python functions, so not sure why he thought that these would be native code. The scitbx/math/init.py also seems to be Python rather than C++, so not sure why the comment said it was native...

russell-taylor commented 1 year ago

The scitbx.matrix documentation says that the scitbx.math module provides faster C++ alternatives to some algorithms here, so we should check there next. Otherwise, maybe we need to drop this into C++? First check other optimizations and see if they are using time elsewhere.

The scitbx.matrix and scitbx.math libraries seem to be Python all the way down. Asked Nigel and Oleg whether there is a native-code replacement for this and if not where I should add it.

Nigel responded with: Check out cctbx_project/scitbx/math/rotate_around_axis.h. It looks like it is implementing generic rotation along with usage of sin/cos precalculated tables to make it even faster. I think this will be a good place for your function as well if for some reason existing ones do not work.

There is actually a rotate_points_around_axis() function call there that loops over a flex array of points, which should me much faster for the cases where we're rotating a lot of points, but neither it nor the rotate_point_around_axis() from that header seem to be called from Python code and I can't just use it as scitbx.math.rotate_point_around_axis().

Calling scitbx.matrix.rotate_point_around_axis() is about as fast as the function above. There is a comment in the file that defines it that says that the C+ code is slower because of the tuple-vec3 conversion overhead. However, when we are calling this function we're having to do lots of tuple conversion, which is slowing us down because it takes and gives tuples and we actually want vectors. However, it looks like this function is not mapped in scitbx_math_ext.

(done) We're also subtracting and normalizing the second axis point and then adding it back in and undoing it and then the function we call is doing it all over again, so we can speed things up by changing the parameters to the function (but this will make us do extra work in the 3-point docking code and the test code, which actually has a directional axis).

Wrote a C++ function that adjusts the parameters and calls the templated C++ function in scitbx/math.

Improvements through 8/25/2023 made this take 15 seconds wall-clock run from the command line.

Improvements through 8/31/2023 (custom rotator C++ function wrapped just for this rotator) made this take 13 seconds wall-clock run from the command line.

Improvements through 9/4/2023 (custom rotator C++ function wrapped just for this rotator) made this take 13 seconds wall-clock run from the command line.

russell-taylor commented 1 year ago

(done) 4fen

Modified _BruteForceOptimizer._optimizeCliqueCoarse() and _CliqueOptimizer._optimizeCliqueCoarse() to not set a Mover state if it is already in that state. This did not seem to improve the fractions. Modified the FastOptimizer to track and report the number of cached vs. calculated atom scores.

This dropped the time from around 5 minutes working time to around 3 minutes for 4fen. It had a calculated/total ratio of about 30%.

(done) Why do the times not still show large counts for the optimization even though that took up the vast majority of the time? And why does the calculation take more like 15 wall-clock minutes? Did things stop early? Running again got answers closer to the initial run.

(done) Is it getting the same answer? Yes.

Running the original way: wall clock time was 13 minutes and 44 seconds. User+sys time was 120 seconds. The resulting optimization has the hydrogens in the same locations, at least for the NCOs. The new approach takes 12 minutes and 21 seconds, so it is a bit faster and we'll leave it in.

(done) Less than 2% of the time was spent in _PlaceMovers for this run. 40% of the entire time is spent in the _SingletonOptimizer._scoreAtom() function, which is the vast bulk of the time spent in FastOptimizer._scoreAtom() so the caching is not taking a lot of time. 20% and 20% of the time are spent in BruteForceOptimizer._optimizeCliqueCoarse() and _CliqueOptimizer._optimizeCliqueCoarse(), counting for another 40% of the total time. A total of 93% of the time was spent in _optimizeCliqueCoarse().

After moving InteractionGraph calculations into C++ and using i_seq dictionary lookups. After switching PositionReturn type into C++. 95.5% of the atom-scores were cached. 79% of the time was spent in the non-cached _scoreAtom call and 6.5% in the cached one. The next-largest functions were _scorePosition (2.5%) and _setMoverState (2%). Total wall-clock time dropped to 7 minutes, 49 seconds. The bulk of that time is spent optimizing the size-7 and size-6 cliques, with irreducible size-5 and size-6 cliques in them.

(done) Why are the size-6 cliques irreducible? These are the NCOs. The NH3 hydrogens for opposite-side Movers should not overlap, but we may have the two far nitrogens overlapping when we have a large enough probe radius. The problem was a mistaken attempt to reduce the maximum clique size test. After fixing this, the wall-clock time is 7 minutes, 22 seconds and the fraction of cached scores was 66%, 37% of the time was in the non-cached _scoreAtom and 1% in the cached. 27% was in optimized _optimizeCliqueCoarse and 27% in brute-force _optimizeCliqueCoarse (most of its extra work being in the function body).

When we make the score_dots() C++ function return a valid return with all 0's, the size-4 cut-graph calls times dominate the run time, followed by the size-3 cut-graph calls. Command-line wall-clock time is 4 minutes 58 seconds and profiling shows the recursive and brute-force _optimizeCliqueCoarse sharing the bulk of the run times, with 43% in the function body of the optimized and 43% in the body of brute-force. This is because we're basically terminating the recursion for clique-size 1 almost all of the time.

(done) When we implement the brute-force and vertex-cut clique optimization in C++, we get a speedup of about 2.5X, with a total run time of 3 minutes, 6 seconds. (This code is still using some Python data structures internally, but the timing shows this is fairly insignificant.) 34% of the atom interactions are calculated, so caching is not helping us as much for this model, with 83% of the total run time spent in _scoreAtom. Only 1% of the time is being spent exclusively in _optimizeCliqueCoarse (less than inclusively adding hydrogens or reinterpreting the model or the interaction graph and much less than rotatable_hd_selection). Around 2% is being used in the caching _scoreAtom and 83% in the main one.

(done) Removing the bonded heavy-atom neighbor from rotatable hydrogens (had been added in reduce2 and is not in reduce) got the speed down to 56 seconds. Realized that these must be put back in. Improvements through 8/25/2023 made this take 3 minutes, 3 seconds wall-clock run from the command line.

Improvements made through 8/30/2023 made this take 2 minutes, 11 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements made through 9/4/2023 made this take 2 minutes, 8 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements made through 9/6/2023 made this take 2 minutes, 4 seconds wall-clock time run from the command line, with almost all of the time spent optimizing cliques.

Improvements mad through 9/8/2023 made this take 48 seconds (31 seconds optimizing cliques).

Improvements mad through 9/11/2023 made this take 30 seconds (22 seconds optimizing cliques, mostly in the Python code).

russell-taylor commented 1 year ago

(done) Model 1rrr takes 54 seconds wall clock, so using it to try and optimize for medium-sized models. It took 70 seconds when run in the profiler.

There are twenty single-hydrogen rotators, all singletons.

From Dorothee: I added checks for two select() calls. With this, 1RRR runs 5s faster for me. Unfortunately, it won't speed up things for 3j4p, as this example requires the two select() calls. It may be possible to consolidate two of the select calls, but it needs testing. Will keep you posted.

This dropped 1rrr runtime to 39 seconds on the command line.

Improvements through 8/25/2023 made this take 35 seconds wall-clock run from the command line. The new profiler run shows:

Improvements made through 8/30/2023 make this take 35 seconds wall-clock time from the command line. the new profiler run shows:

Speeding up the determination of single-hydrogen rotators sped up the optimization and reduced the overall time to 18 seconds. We'll be removing _ReinterpretModel, at which point it should become even faster. Indeed, on 9/26/2023 we get 13.45 seconds without reintepretation (11 seconds to add hydrogen, 2 to optimize: 0.13 to place on each model).

Model 3j4p also takes the bulk of its time in optimization, with most of its time on coarse-clique optimization (mostly in _scorePosition, mostly in _scoreAtom). When we modify mmtbx_probe_ext:DotScorer::score_dots() to return a valid result of all zeroes, the command-line time drops to 30 seconds from 104 and the profiling shows 2% of the time in _scoreAtom (down from 50+%). There is still 8% in _scorePosition, so we may have some room there but the vast bulk is in the actual C++ calculations.

russell-taylor commented 1 year ago

@todo: Removing the unrecognized ligands from 7k00 to see how long Reduce2 takes to run on a large model. Reduce takes around 20 seconds to optimize 7k00, but it does not have cliques that are as large.

7k00 takes a long time in hydrogen addition (presumably because it is so large). Slightly more than half of the time is in mmtbx.model.model.process, mostly in setting up restraints. 17% is in mmtbx.model.model.select (and another 9% calling this inside another function). Cryo-EM structures are usually bigger.

When taking out the unrecognized ligands 7k00 it takes more than 12 hours, with the vast majority of the time spent in clique optimization. Original Reduce had one clique of size 5 and nine 4s and most of size 2 but reduce2 had at one of size 11, one of size 10, and twenty-five larger than 4. Reduce had 1302 Movers in cliques and 5068 singletons (6370 total Movers). Reduce 2 had 1853 Movers in cliques and 4442 singletons (6295 total Movers). Reduce2 was unable to place a large number of hydrogens due to insufficient restraints, which could explain why there are fewer Movers.

Replaced the internal code for DotScorer::score_dots() with a simple return of 0 scores and valid response to remove all of the time spent in the C++ scoring code so that we can just count the Python time, including wrapping/unwrapping as part of the function calls. The clique solvers are taking several seconds for the larger-than-4-sized cliques, and even some of the 3-4-sized ones are a bit slow; one of the 7-cliques took many minutes. Running in 1xso verified that the optimization results are consistent with scoring not being done. This implies that for large cliques, the atom shuffling and Python book-keeping is a significant portion of the solver time above and beyond the C++ code. Clique opt for 1xso without C++ scoring jumpered out took 0.094 wall-clock seconds. Putting the code back in took 1.148 seconds, so the time for small cliques is dominated by the C++ code time. For 4fen (several 6-7-Mover cliques), optimization time with jumpered-out C++ was 262.563 seconds and with all calculations was 428.386, indicating that the Python code is taking ~60% of the time for medium-sized cliques.

(done) Consider pulling the clique-optimizing code into C++. Pulling the PositionReturn class into C++ seemed to speed this up tremendously.

(done) Constructing the interaction graph takes minutes. This could make it worth checking whether just the atom shuffling is slowing us down a lot (consistent with clique optimization being slow even without the C++ calculations being done). Moved the _PairsOverlap into C++ and used seq_id rather than atom as the dict lookup made this much faster, on the order of the re-interpretation time and faster than the hydrogen-placement time.

With optimizations made through 8/11/2023, the total time dropped to around 81 minutes (down from 12+ hours). The fraction of atom scores that were calculated is 0.02 (98% cached). The functions taking the most time were not properly reported by the profiler; the vast majority is in clique optimization.

With optimizations made through 8/21/2023, the total time dropped to 34 minutes. The time is evenly split between selection of rotatable hydrogens, time to place Movers, and time to optimize cliques. The fraction of calculated atom interactions was 0.21.

With optimizations made through 8/25/2023 (including putting back the tests for heavy neighbors of rotating hydrogens), it completes in under 42 minutes. Fraction of calculated atom interactions was 0.21. Adding hydrogen 194, interpret 132, optimize 2212.

With optimizations made through 8/30/2023 (including pulling objects, methods, and caches into C++ and caching the interacting dots per atom and position), it completes in under 34 minutes. Add hydrogens 184, interpret 132, optimize 1731. Another run's optimization: time to select rotatable hydrogens 580, place Movers 506, interaction graph 128, singletons 36, cliques 615, fine 13, fixup 4, total optimize 1916. With changes made through 8/31/2023, optimization was select rotatable hydrogens 408, place Movers 440, interaction graph 124, singletons 36, cliques 596, total 1655. (Speeding up the atom rotation did not seem to speed up Place Movers vs. select rotatable hydrogens.) Run on 9/4 after new opts: Add Hydrogen 196, Interpret 140, Optimize 1632 (select rotatable 403, place Movers 441, interaction graph 120, singletons 35, cliques 584, fine 12, fixup 5).

With optimizations made through 9/6/2023 (including faster single-hydrogen rotator finding), it completes in under 21 minutes: Add hydrogen 188, interpret 136, optimize 890 (select rotatable 0.2, place Movers 63, interaction graph 100, singletons 37, cliques 613, fine 12, fixup 5). The bulk of the time is in C++ Mover optimization. Fraction calculated is 0.29, so we're probably seeing a bottleneck inside score_dots(). Making that function return immediately dropped singleton time to 12 seconds and clique time to 20 seconds. Making the inner-loop check_dot() function return immediately dropped singleton time to 21 seconds and clique time to 124, so most of the time is in the inner function but a significant fraction is in the outer.

With optimizations made through 9/11/2023 (including putting more data into ExtraAtomInfo and using vector storage in ExtraAtomInfoMap), it completes in under 12 minutes: add hydrogen 191, interpret 135, optimize 352 (place Movers 61, interaction graph 97, singletons 13, cliques 140, fine 4, fixup 5).

We have the same results for 7k00 on 9/20/2023 after the Interaction Graph speedups. Wall-clock time under 11 minutes: add hydrogen 190, interpret 125, optimize 262 (place Movers 61, interaction graph 31, singletons 12, cliques 114, fine 3, fixup 5).

On 9/21/2023 after making the AtomMoverLists C++ class, wall-clock time was around 10 minutes: add hydrogen 200, interpret 135, optimize 242 (place Movers 62, interaction graph 2.8, singletons 13, cliques 119, fine 3, fixup 6).

On 9/26/2023 after removing reinterpretation, wall-clock time of 7 minutes, 30 seconds: add hydrogen 197, optimize 249 (place Movers 65, interaction graph 2.9, singletons 12, cliques 120, fine 3, fixup 5).

On 10/2/2023 after making much sparser single-hydrogen orientation checks, wall-clock time is 5 minutes, 24 seconds: add hydrogen 197, optimize 99 (place Movers 42, interaction graph 0.3, singletons 4, cliques 5.5, fine 4, fixup 5).

russell-taylor commented 1 year ago

(done) Both 7k00 and 1zz0 have much larger clique sizes for Reduce2 than Reduce, even when there are fewer total Movers.

russell-taylor commented 1 year ago

(nope) Try only creating the fine positions when they are asked for in the MoverRotator, rather than pre-computing all of them.

russell-taylor commented 1 year ago

(nope) Dropping the coarse resolution for NH3 rotators back to the original (30 vs. 15) will speed up cliques, but will be less precise -- there were differences between Reduce2 and Reduce at Reduce's original stride. Single-hydrogen rotators have a value of 10 in both Reduce and Reduce2.

Switching this for 1zz0 changed the time from 110 seconds total (76.4 opt) to 110 (77.2 opt), so we're not seeing an impact on speed for this molecule.

russell-taylor commented 12 months ago

(done) Pass references rather than copies into InteractionGraph.h routine.

russell-taylor commented 12 months ago

(done) Can we find a faster way to locate rotatable hydrogens for our single-hydrogen rotators?

russell-taylor commented 12 months ago

(done) Single-hydrogen rotators going all the way around rather than only pointing at potential acceptors and one other location. The original Reduce did not include the full coarse set.

1yk4 is a good test for a single-hydrogen optimization model. It took me under 43 seconds wall-clock time when I ran on this file. Almost all of that time was optimization. Almost all of that time was clique optimization. There is a single 3-way clique of single-hydrogen rotators in each of 3 alternate conformations:

  Set of 3 Movers:  Totals: initial score 7.68, final score 17.35
   SingleHydrogenRotator at chain A CYS 6 HG Initial score: 0.64 final score: 4.45 pose Angle 135.0 deg . .
   SingleHydrogenRotator at chain A CYS 39 HG Initial score: 3.35 final score: 7.43 pose Angle 131.0 deg . .
   SingleHydrogenRotator at chain A CYS 42 HG Initial score: 3.69 final score: 5.47 pose Angle -37.0 deg . .

(done) Consider looking for the coarse orientation that is the largest gap with the nearest atom at that orientation (an estimate of the least-clashing angle) that is not near an acceptor. Fine rotations will happen around this point out to the distance of half of the coarse step size in each direction.

(nope) Consider increasing the coarse spacing to 60 degrees (up from 10) on the assumption that it will probably find a good non-colliding direction that way while still always finding a close acceptor when needed because of the explicit search. The fine search will cover a range of +/-30 degrees once the coarse position is chosen (including if an acceptor angle is chosen).

(done) Rerun new_reduce_regression script to ensure we're getting good results.

Modified the code to pick the orientation that has the best contact (closest to just touching) with any nearby atom, even an Acceptor, rather than picking the orientation that is the furthest from contact. This improves the scores compared to the furthest contact, and made some of them better than Reduce scores, but some are still worse.

(done) Run comparison using 1xso and perhaps other models against original Reduce to check Mover-by-Mover behavior to find out which are worse. (done) How can this be even though theMover-by-Mover comparison shows better summed scores for reduce2 compared to the one saved by running original reduce on my laptop (some are worse by up to 1 part in 6, some better by this much or more)? The new_reduce_regression is checking for 4-long bonded neighbor chains in all cases (as expected). Rerunning to compare against the output PDB generated by that script to see if the same thing happens shows that it does. It looks like the improvements made by the Mover hydrogens are offset by the better C and N contacts from the original model, which must be from non-rotatable hydrogens? However, the hydrogen bond lengths are the same for the old and new models and there are only very slight shifts in angles between them for the non-rotatable hydrogens. The C and N contacts and overlaps are pretty much the same for both new approaches (total check and single check), with the O contacts being better for both new (even better for total check). The differences are about 1 part in 100 of the total scores, so they are very close to each other in all cases.

Spot check of single rotator OG1 THR A 98 shows better contacts with reduce2. (There are several amide flips in different orientations and some NH3 groups at different orientations.)

(done) Consider adding more tests at the original coarse orientations, but only when the new orientation is at least N degrees away from a previous choice (including best touch and any potential acceptors and already-added ones). This would only add more tests to cover additional areas and may not take too much extra time.

russell-taylor commented 12 months ago

(done) Calls to score_dots(), with its own calls to check_dot() are taking the bulk of the time in 7k00, with check_dot() taking a larger fraction of the time. Looking for a model system that runs faster to try and optimize this, so using 1zz0 (which takes 96 seconds total, with initially 1.5 seconds singleton and 21 seconds cliques.

With the above changes, 4fen goes from 2 minutes 4 seconds wall-clock time (mostly optimizing cliques) to 48 seconds (31 seconds optimizing cliques, of which 2 is Python) while getting (as expected) the same answer.

With the above changes, 7k00_trim goes from under 21 minutes wall-clock time (613 optimizing cliques) to 14 minutes 17 seconds (257 optimizing cliques).

russell-taylor commented 12 months ago

Much of the remaining time in coarse optimization is in check_dot() when it calls getMappingFor() on the atoms to get their extra atom info.

(done) Consider whether the problem is that we're copying the structures out rather than passing references.

(done) Consider storing this mapping internally in a vector based on i_seq rather than in a map based on the atom's raw pointer to speed up the lookup. See if we still need to keep the pointers to the atom data in that case, I think not.

The coarse-optimization code takes 21-22 seconds on 4fen when using a vector rather than a map.

russell-taylor commented 11 months ago

(done) Further speeding up of interaction graph, using 1zz0 as a fast optimization case

We're calling CoarsePositions() and FinePositions() on each Mover, so could do this in Python and call C++ code that receives const references to af::shared::PositionReturn vectors. Both methods return a graph and one also a dictionary of sets of interacting Movers looked up by atom.

Do we want to pull the calculation of the ranges into C++?

After pulling the contents of the AABB construction into C++, we get the same answers and it takes 1.0 seconds to compute the interaction graph for 1zz0. This is the expected reduction if we took almost all of the time out of running the function.

Pulling the lookup codes from InteractionGraphAllPairs() into C++ didn't change the runtime, indicating that the bulk of that function's work is already being done in C++.

Making the AtomMoverLists into a C++ data structure got the time to compute the interaction graph down to 0.1 seconds.

russell-taylor commented 10 months ago

@todo: Very high memory use and long run time. Tom gave us a file that he was trying to run refinement on. It has 142,140 Movers. With Reduce, this file takes a few minutes and uses around 4GB of RAM. With the hydrogen placement part of Reduce2 it uses more than 180GB peak RAM; in total it runs for 20.5 hours (10,978 to add hydrogens, 61,898 to optimize: 44,963 place Movers, 5,663 dump atom info to string, 1,248 opt singletons, 1,413 opt cliques, 2,348 fixup), at least partly due to causing thrashing on a 32GB machine. The input file is at https://drive.google.com/file/d/1jRCUXHfC25kMbRM7MrK1gkNArGGKW7Cg/view?usp=sharing and it is 250MB in size.

Running on the single instance of 3kz4 shows 69 seconds to add hydrogens and 48 to optimize (33 place Movers, 3 for excluded atoms, 3 in Helpers.writeAtomInfoToString [dropped to 2.3 by reorganizing], 1 opt singletons, 1.4 opt cliques, 0.7 fine, 1.4 fixup); we’ll use it to try and find the hot paths.

russell-taylor commented 9 months ago

Speeding up _PlaceMovers:

russell-taylor commented 9 months ago

(done) Allow the command line to remove the generation of the atom-dump string, which is taking a large fraction of the total run-time for this file..

Removing this drops the run time by 94.5 minutes.

russell-taylor commented 9 months ago

(done) Re-running Tom's file after the above changes were made resulted in a 30,758 second time to add hydrogens and a 9,908 second time to optimize (4298 place Movers, 1141 Singletons, 1300 cliques, 2446 fixup). The hot functions were:

The call tree has most of the time in reduce_hydrogen.run:

(nope) Replacing the Rotate*DegreesAroundAxisDir() calls to use the fast sin/cos routines make 3kz4 Mover placement take 17.8 seconds. When using sin/cos, it takes 16.7, so we're not winning by swapping out this call. Inlining the setup function and computing parameters at compile time made the speed 16.3, so not worth the trouble. Switched to an approach that precomputes everything including the references and uses static double arrays gets it to 15.9 seconds, so still probably not worth the trouble. In fact, the variation seems to be in the noise. Calling the rotate_point_around_axis() function twice to make sure it is what is taking the time: nope, it looks like it took 17.559 seconds, so it is not the issue? Running this again inside the Python profiler to check; took 6;88% in the hot path with NH3 rotators when called once, and 8.23% when called twice; wall-clock time goes from 16.9 to 18.9 seconds (about 1/8th more). On the other hand, returning just the atom coordinates without transforming takes 16.7 seconds (about the same), so it looks like the rotation function itself is not taking the time.

(done) Consider writing _posesFor() in C++. Replacing the body of _posesFor() with a call to a C++ FindPosesFor() function gives the same answer on 1xso. For 3kz4 it takes 6.576 seconds to place Movers the new way and 16.227 the original way; the resulting PDB file is the same in both cases. Running inside the performance profiler shows that the work done during optimization for the original approach to take 9% of the total time in _posesFor; with the C++ function it was 0.87%.

russell-taylor commented 3 months ago

Looking into speeding up mmtbx.conformation_dependent_library.mcl.update: