This PR switches the primary interface for precovery.precover from returning a pd.DataFrame of unioned precovery candidates and frame candidates into two separate return values. The new format takes the form Tuple[List[PrecoveryCandidate, FrameCandidate]]. The second major change eliminates the include_frame_candidates option, opting instead to always return frame candidates.
The two objects being returned have distinct schemas and represent distinct types. When performing work on the results of a precovery search, most downstream tasks involve only the PrecoveryCandidate matches. Returning the results as a unioned iterable requires every downstream consumer to filter them out with type checking. Even the cases where both matches and misses are wanted, typically the sets are separated when additional work is performed.
A unioned dataframe is doubly difficult because now the filtering must be done heuristically by analyzing column values (e.g. df.dropna(subset=["mjd"])) and hoping that the resulting subset is valid. This unioned schema also causes problems when attempting to cast column types, but some fields are null, such as casting observation_id to str which in turn causes the frame candidate row NaNs to become literal "nan".
Why not two dataframes?
Dataframes have very limited typing semantics. It is difficult to make guarantees about nullability. By returning more specific types, downstream consumers can more easily inspect and react to individual rows. I've added two utility functions to initialize dataframes from the dataclasses when wanted (as_df = candidates_to_dataframes(precovery_candidates)).
Why always return frame candidates?
There is almost no perceived performance benefit in avoiding returning the frame candidates. By always returning a tuple (matches, misses), the contents of the return values are never surprising. If a downstream consumer does not care about the frame candidates, it can simply discard them.
Other changes
I moved the sorting logic further up into the main PrecoveryDatabase.precover function (precovery.precovery_db.sift_candidates). This ensures that the precovery.main.precover function is just syntactic sugar for the db.precover function and that the two return identical results. Previously the functional version performed separate sorting and type casting (observation_id -> str) that differed from the streaming style results of the db method.
This PR switches the primary interface for
precovery.precover
from returning apd.DataFrame
of unioned precovery candidates and frame candidates into two separate return values. The new format takes the formTuple[List[PrecoveryCandidate, FrameCandidate]]
. The second major change eliminates theinclude_frame_candidates
option, opting instead to always return frame candidates.Old
New
"why two values instead of a single value?"
The two objects being returned have distinct schemas and represent distinct types. When performing work on the results of a precovery search, most downstream tasks involve only the
PrecoveryCandidate
matches. Returning the results as a unioned iterable requires every downstream consumer to filter them out with type checking. Even the cases where both matches and misses are wanted, typically the sets are separated when additional work is performed.A unioned dataframe is doubly difficult because now the filtering must be done heuristically by analyzing column values (e.g.
df.dropna(subset=["mjd"])
) and hoping that the resulting subset is valid. This unioned schema also causes problems when attempting to cast column types, but some fields are null, such as castingobservation_id
tostr
which in turn causes the frame candidate rowNaNs
to become literal"nan"
.Why not two dataframes?
Dataframes have very limited typing semantics. It is difficult to make guarantees about nullability. By returning more specific types, downstream consumers can more easily inspect and react to individual rows. I've added two utility functions to initialize dataframes from the dataclasses when wanted (
as_df = candidates_to_dataframes(precovery_candidates)
).Why always return frame candidates?
There is almost no perceived performance benefit in avoiding returning the frame candidates. By always returning a tuple (matches, misses), the contents of the return values are never surprising. If a downstream consumer does not care about the frame candidates, it can simply discard them.
Other changes
I moved the sorting logic further up into the main
PrecoveryDatabase.precover
function (precovery.precovery_db.sift_candidates
). This ensures that theprecovery.main.precover
function is just syntactic sugar for thedb.precover
function and that the two return identical results. Previously the functional version performed separate sorting and type casting (observation_id -> str) that differed from the streaming style results of the db method.