B612-Asteroid-Institute / precovery

Fast precovery of small body observations at scale
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Proposal: Switch to non-dataframe tupled results #85

Closed akoumjian closed 1 year ago

akoumjian commented 1 year ago

This PR switches the primary interface for precovery.precover from returning a pd.DataFrame of unioned precovery candidates and frame candidates into two separate return values. The new format takes the form Tuple[List[PrecoveryCandidate, FrameCandidate]]. The second major change eliminates the include_frame_candidates option, opting instead to always return frame candidates.

Old

>>> df = precovery.precover(...)
>>> df.columns
Index(['mjd', 'ra_deg', 'dec_deg', 'ra_sigma_arcsec', 'dec_sigma_arcsec',
       'mag', 'mag_sigma', 'filter', 'obscode', 'exposure_id',
       'exposure_mjd_start', 'exposure_mjd_mid', 'exposure_duration',
       'observation_id', 'healpix_id', 'pred_ra_deg', 'pred_dec_deg',
       'pred_vra_degpday', 'pred_vdec_degpday', 'delta_ra_arcsec',
       'delta_dec_arcsec', 'distance_arcsec', 'dataset_id', 'orbit_id'],
      dtype='object')

New

>>> precovery_candidates, frame_candidates = precovery.precover()
>>> print(precovery_candidates)
[PrecoveryCandidate(mjd=50000.0, ra_deg=205.88903218124665, dec_deg=-15.381163692732548, ra_sigma_arcsec=10800.0, dec_sigma_arcsec=14400.0, mag=5.0, mag_sigma=6.0, exposure_mjd_start=50000.0, exposure_mjd_mid=50000.0, filter='filter', obscode='I41', exposure_id='exposure', exposure_duration=30.0, observation_id='bjewnltujtgbffge', healpix_id=6715579247, pred_ra_deg=205.88903218124662, pred_dec_deg=-15.381163692732573, pred_vra_degpday=1.5288441441416185, pred_vdec_degpday=-0.8115017070879296, delta_ra_arcsec=1.0231815394945443e-10, delta_dec_arcsec=8.952838470577262e-11, distance_arcsec=1.9897641567037682e-10, dataset_id='test_dataset_2'), ...]

"why two values instead of a single value?"

The two objects being returned have distinct schemas and represent distinct types. When performing work on the results of a precovery search, most downstream tasks involve only the PrecoveryCandidate matches. Returning the results as a unioned iterable requires every downstream consumer to filter them out with type checking. Even the cases where both matches and misses are wanted, typically the sets are separated when additional work is performed.

A unioned dataframe is doubly difficult because now the filtering must be done heuristically by analyzing column values (e.g. df.dropna(subset=["mjd"])) and hoping that the resulting subset is valid. This unioned schema also causes problems when attempting to cast column types, but some fields are null, such as casting observation_id to str which in turn causes the frame candidate row NaNs to become literal "nan".

Why not two dataframes?

Dataframes have very limited typing semantics. It is difficult to make guarantees about nullability. By returning more specific types, downstream consumers can more easily inspect and react to individual rows. I've added two utility functions to initialize dataframes from the dataclasses when wanted (as_df = candidates_to_dataframes(precovery_candidates)).

Why always return frame candidates?

There is almost no perceived performance benefit in avoiding returning the frame candidates. By always returning a tuple (matches, misses), the contents of the return values are never surprising. If a downstream consumer does not care about the frame candidates, it can simply discard them.

Other changes

I moved the sorting logic further up into the main PrecoveryDatabase.precover function (precovery.precovery_db.sift_candidates). This ensures that the precovery.main.precover function is just syntactic sugar for the db.precover function and that the two return identical results. Previously the functional version performed separate sorting and type casting (observation_id -> str) that differed from the streaming style results of the db method.