Factor of 2x speed up increase

moeyensj commented 2 years ago

On-going work to address #17.

stevenstetzler commented 2 years ago

I've added several new functions that implement a "vectorized" optimization i.e. functions are evaluated using arrays of ra/dec/mjd instead of scalars. These new functions are:

distance_opt in frame_db.py
radec_to_thetaphi_opt and radec_to_healpixel_opt in healpix_geom.py (I defined these for numpy typing reasons, but don't use them since their non-opt counterparts work find with array inputs, so they can likely be removed)
compute_ephemeris_opt and approximately_propagate_opt in orbit.py
precover_opt, _check_windows_opt, _check_window_opt, _check_frames_opt and _check_frames_opt_2 in precovery_db.py
propagate_linearly_opt in spherical_geom.py

I've added an --opt argument to scripts/precovery-test.py that will use precovery_opt instead of precovery to make before/after comparisons easier. Timing results can be compared with:

$ source scripts/profile.sh # before opt
$ source scripts/profile.sh --opt # after opt

There were a few levels of optimization, so I'll describe them in order I implemented. The code as written does something like:

precover
- calls _check_window(mjd, obscode, orbit, tolerance)
  - calls compute_ephemeris(obscode, mjd)
  - calls radec_to_healpixel(ra, dec, nside)
  - calls approximately_propagate(obscode, orbit, timedelta)
  - calls radec_to_healpixel(approx_ra, approx_dec, nside)
  - calls _check_frames(orbit, healpix, obscode, mjd, tolerance)
    - calls compute_ephemeris(obscode, mjd)
    - calls obs.distance(ephem)

Which results in ~100k calls to compute_ephemeris/radec_to_healpixel/distance.

First, optimizing ephemeris computation:

compute_ephemeris_opt: the function is changed to accept an array of epochs and passes those to pyoorb as an array. an list of ephemerides are returned instead of a single one.
approximately_propagate_opt: the function is changed to accept an array of epochs. The check for timedelta small is converted to an np.where, which determines if linear or 2-body motion is used. FYI: the current check as implemented is np.abs(timedeltas) <= -1.0 based on the previous abs(timedelta) <= -1.0. Note that this never occurs, so linear motion is never used to approximately propagate. Which might not be the desired behavior?

Most of the ephermeris computations were happening in _check_frames, so I started there and made the changes:

_check_window calls _check_frames with single mjd and healpixel, so I produced _check_frames_opt which accepts an array of mjds and healpixels. The changes are (roughly)

# _check_frames_opt(orbit, healpixels, obscode, mjds, tolerance)
exact_ephems = orbit.compute_ephemeris_opt(obscode, mjds)
for exact_ephem, mjd, healpix in zip(exact_ephems, mjds, healpixels):
frames = self.frames.idx.get_frames(obscode, mjd, healpix)
for f in frames:

instead of:

# _check_frames_opt(orbit, healpixel, obscode, mjd, tolerance)
exact_ephem = orbit.compute_ephemeris(obscode, mjd)
frames = self.frames.idx.get_frames(obscode, mjd, healpix)
for f in frames:

this requires a corresponding change in _check_window, producing _check_window_opt, where the mjd and healpixel values are stored in a list that is passed to _check_frames_opt:

# _check_window_opt(window_midpoint, obscode, orbit, tolerance)
timedeltas = []
test_mjds = []
test_healpixels = []
for mjd, healpixels in self.frames.idx.propagation_targets(
start_mjd, end_mjd, obscode
):
logger.debug("mjd=%.6f:\thealpixels with data: %r", mjd, healpixels)
timedelta = mjd - window_midpoint
timedeltas.append(timedelta)
test_mjds.append(mjd)
test_healpixels.append(healpixels)

the timedeltas are passed to approximately_propagate_opt as an array:

# _check_window_opt(window_midpoint, obscode, orbit, tolerance)
approx_ras, approx_decs = window_ephem.approximately_propagate_opt(
obscode,
orbit,
timedeltas,
)

and the ra/dec to healpixel computation uses the array as well:

# _check_window_opt(window_midpoint, obscode, orbit, tolerance)
approx_healpixels = radec_to_healpixel(
approx_ras, approx_decs, self.frames.healpix_nside
).astype(int)

Finally mjd/healpixel values are filtered per _check_window

# _check_window_opt(window_midpoint, obscode, orbit, tolerance)
keep_mjds = []
keep_approx_healpixels = []
for mjd, healpixels, approx_ra, approx_dec, approx_healpix in zip(
test_mjds, test_healpixels, approx_ras, approx_decs, approx_healpixels
):
logger.debug("mjd=%.6f:\thealpixels with data: %r", mjd, healpixels)
logger.debug(
    "mjd=%.6f:\tephemeris at ra=%.3f\tdec=%.3f\thealpix=%d",
    mjd,
    approx_ra,
    approx_dec,
    approx_healpix,
)

if approx_healpix not in healpixels:
    # No exposures anywhere near the ephem, so move on.
    continue
logger.debug("mjd=%.6f: healpixel collision, checking frames", mjd)
keep_mjds.append(mjd)
keep_approx_healpixels.append(approx_healpix)

and then passed to _check_frames_opt:

# _check_window_opt(window_midpoint, obscode, orbit, tolerance)
if len(keep_mjds) > 0:
matches = self._check_frames_opt(
    orbit, keep_approx_healpixels, obscode, keep_mjds, tolerance
)
for m in matches:
    yield m

In actuality, the code uses _check_frames_opt_2 which vectorizes distance computation as well by grabbing all observations and using the haversine_distance_deg function instead of obs.distance to compute the distance between all observations and an orbit ephemeris:

# _check_frames_opt_2(orbit, healpixel, obscode, mjd, tolerance)
for f in frames:
    logger.info("checking frame: %s", f)
    obs = np.array(list(self.frames.iterate_observations(f)))
    n = len(obs)
    obs_ras = np.array([o.ra for o in obs])
    obs_decs = np.array([o.dec for o in obs])
    distances = haversine_distance_deg(
        exact_ephem.ra,
        obs_ras,
        exact_ephem.dec,
        obs_decs,
    )
    dras = exact_ephem.ra - obs_ras
    ddecs = exact_ephem.dec - obs_decs
    # filter to observations with distance below tolerance
    idx = distances < tolerance
    distances = distances[idx]
    dras = dras[idx]
    ddecs = ddecs[idx]
    obs = obs[idx]
    for o, distance, dra, ddec in zip(obs, distances, dras, ddecs):
        candidate = PrecoveryCandidate(
            ra=o.ra,
            dec=o.dec,
            ra_sigma=o.ra_sigma,
            dec_sigma=o.dec_sigma,
            mag=o.mag,
            mag_sigma=o.mag_sigma,
            filter=f.filter,
            obscode=f.obscode,
            mjd=f.mjd,
            catalog_id=f.catalog_id,
            id=o.id.decode(),
            dra=dra,
            ddec=ddec,
            distance=distance,
        )
        yield candidate

which uses numpy indexing to determine for which observations distance < tolerance.

Finally, the _check_window_opt function is changed to _check_windows_opt, which modifies _check_window_opt to accept and array of window midpoints and use compute_ephemeris_opt instead of compute_ephemeris to propagate the orbit to window midpoint:

# _check_windows_opt(window_midpoints, obscode, orbit, tolerance)
window_ephems = orbit.compute_ephemeris_opt(obscode, window_midpoints)
window_healpixels = radec_to_healpixel(
    np.array([w.ra for w in window_ephems]),
    np.array([w.dec for w in window_ephems]),
    self.frames.healpix_nside,
).astype(int)

for window_midpoint, window_ephem, window_healpixel in zip(
    window_midpoints, window_ephems, window_healpixels
):
    ...

This also requires a change precover_opt to account for the fact that we can only propagate times that have the same corresponding obscode. To solve this I do a groupby:

# precover_opt(orbit, tolerance, max_matches, start_mjd, end_mjd)
windows = self.frames.idx.window_centers(start_mjd, end_mjd, self.window_size)
for obscode, obs_windows in itertools.groupby(
    windows, key=lambda pair: pair[1]
):
    mjds = [window[0] for window in obs_windows]
    matches = self._check_windows_opt(mjds, obscode, orbit, tolerance)

As written, the code structure is now:

precover_opt
- calls _check_windows_opt(mjds, obscode, orbit, tolerance)
  - calls compute_ephemeris_opt(obscode, window_midpoints)
  - calls radec_to_healpixel(ras, decs, nside)
  - calls approximately_propagate_opt(timdeltas)
  - calls radec_to_healpixel(approx_ras, approx_decs)
  - calls _check_frames_opt_2(orbit, keep_approx_healpixels, obscode, keep_mjds, tolerance)
    - calls compute_ephemeris_opt(obscode, mjds)
    - calls haversine_distance_deg(exact_ephem.ra, obs_ras, exact_ephem.dec, obs_decs)

and calls compute_ephemeris_opt 683 times.

Profiling results with the current code are:

Found 27 potential matches for orbit ID: 330784
    orbit_id                   id  ...          ddec  distance
0     330784     c4d.230114.7.517  ...  5.349504e-05  0.000084
1     330784    c4d.230826.13.872  ...  6.738136e-05  0.000091
2     330784   c4d.231472.47.1005  ...  6.280554e-05  0.000089
3     330784    c4d.232238.26.124  ...  2.968479e-05  0.000030
4     330784    c4d.232344.26.468  ...  5.881602e-05  0.000076
5     330784    c4d.232345.26.457  ...  6.842665e-05  0.000087
6     330784   c4d.232751.48.1401  ...  6.014635e-05  0.000079
7     330784   c4d.232752.48.1507  ...  6.366128e-05  0.000080
8     330784    c4d.232951.46.266  ... -4.628035e-05  0.000083
9     330784     c4d.233015.30.38  ...  3.616820e-05  0.000042
10    330784     c4d.233134.30.94  ...  5.665683e-05  0.000065
11    330784    c4d.233364.51.272  ...  3.803622e-05  0.000043
12    330784    c4d.233365.51.226  ...  4.792459e-05  0.000048
13    330784    c4d.233473.9.1618  ...  6.756319e-05  0.000080
14    330784    c4d.233475.9.1517  ...  6.549203e-05  0.000080
15    330784    c4d.237663.60.588  ...  5.767154e-05  0.000060
16    330784    c4d.237968.59.683  ...  4.574217e-05  0.000062
17    330784    c4d.238654.38.733  ...  4.623282e-05  0.000046
18    330784    c4d.238655.44.500  ...  1.607676e-04  0.000172
19    330784    c4d.238656.44.583  ...  1.769334e-04  0.000189
20    330784    c4d.238657.44.699  ...  1.943913e-04  0.000210
21    330784     c4d.420636.47.31  ... -3.123492e-05  0.000154
22    330784    c4d.420637.47.114  ... -4.783249e-05  0.000152
23    330784    c4d.500315.25.291  ...  2.380609e-05  0.000025
24    330784    c4d.500316.25.296  ... -6.973607e-07  0.000027
25    330784  c4d.777125.15.13724  ...  2.949515e-06  0.000127
26    330784   c4d.777435.15.9269  ...  4.504595e-05  0.000121

[27 rows x 15 columns]
         23941170 function calls (23777916 primitive calls) in 51.888 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      683   23.616    0.035   26.131    0.038 orbit.py:225(compute_ephemeris_opt)
  1543349    3.140    0.000    3.140    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
   524453    2.082    0.000    2.885    0.000 frame_db.py:529(iterate_observations)
  1954135    1.496    0.000    7.802    0.000 frame_db.py:220(<genexpr>)
      219    1.371    0.006   44.634    0.204 precovery_db.py:208(_check_windows_opt)
     4190    1.324    0.000    1.324    0.000 {built-in method numpy.array}
  1543347    1.216    0.000    5.995    0.000 result.py:381(iterrows)
      667    1.041    0.002    1.041    0.002 {method 'execute' of 'sqlite3.Cursor' objects}
    36932    0.999    0.000    1.546    0.000 inspect.py:744(cleandoc)
   412778    0.874    0.000    0.874    0.000 orbit.py:263(__init__)
  1543347    0.864    0.000    4.779    0.000 cursor.py:1791(_fetchiter_impl)
  1543348    0.769    0.000    3.915    0.000 cursor.py:953(fetchone)
   412591    0.743    0.000    8.962    0.000 frame_db.py:191(propagation_targets)

stevenstetzler commented 2 years ago

Based on my explanation and review of the code, the distance_opt method of an Observation is not used in this implementation and instead the haversine_distance_deg function is used with the ras/decs of many observations against a single orbit ephemeris.

stevenstetzler commented 2 years ago

Additionally, I made a script scripts/sq-execute-test.py and scripts/profile-sql.sh that tests whether many small SQL queries are slower that one large one. The answer is they are roughly the same. The comparison is:

for i in range(len(mjds) - 1):
    stmt = sq.select(
        db.frames.idx.frames.c.mjd, db.frames.idx.frames.c.obscode
    ).where(
        db.frames.idx.frames.c.mjd <= mjds[i + 1],
        db.frames.idx.frames.c.mjd >= mjds[i],
    )
    rows = db.frames.idx.dbconn.execute(stmt)
    for row in rows:
       ...

vs constructing a big 'or' where clause:

stmt = sq.select(
    db.frames.idx.frames.c.mjd, db.frames.idx.frames.c.obscode
).where(
    or_(
        *[
            and_(
                db.frames.idx.frames.c.mjd <= mjds[i + 1],
                db.frames.idx.frames.c.mjd >= mjds[i],
            )
            for i in range(len(mjds) - 1)
        ]
    )
)
rows = db.frames.idx.dbconn.execute(stmt).fetchall()

The profiling:

$ source profile-sql.sh --sequential
99 queries got 86682 rows in 0.6292250156402588s
         5071008 function calls (5011474 primitive calls) in 5.197 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    36932    0.435    0.000    0.671    0.000 inspect.py:744(cleandoc)
        1    0.306    0.306    5.203    5.203 sq-execute-test.py:1(<module>)
  184/181    0.224    0.001    0.229    0.001 {built-in method _imp.create_dynamic}
     1496    0.212    0.000    0.212    0.000 {built-in method marshal.loads}
 4028/498    0.139    0.000    0.381    0.001 sre_parse.py:493(_parse)
    20643    0.129    0.000    0.132    0.000 sre_parse.py:172(append)
    86782    0.112    0.000    0.112    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}

fetchone is called 86782 in 0.112s. In comparison:

$ source profile-sql.sh
1 queries got 86682 rows in 0.6342263221740723s
         4639521 function calls (4578209 primitive calls) in 6.157 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    36932    0.532    0.000    0.821    0.000 inspect.py:744(cleandoc)
     1496    0.340    0.000    0.340    0.000 {built-in method marshal.loads}
        2    0.274    0.137    0.274    0.137 {method 'fetchall' of 'sqlite3.Cursor' objects}

fetchall is called twice (why not once?) in 0.274s. Maybe the difference is larger once the number of queries gets big, but it doesn't seem it will make a large difference in runtime for ~100-1000 queries, which I think in effect is the maximum "pooling" size we could expect given the window/frame code structure.

moeyensj commented 2 years ago

Note that this never occurs, so linear motion is never used to approximately propagate. Which might not be the desired behavior?

This was disabled since we found that linear propagation didn't work as well as intended. Two-body is thankfully fast enough as an approximate method.

moeyensj commented 2 years ago

@stevenstetzler Thanks for the fantastic work, I've corrected a few minor issues and accepted all the optimized functions as the new default in precovery.

B612-Asteroid-Institute / precovery

Factor of 2x speed up increase #18