Handle case when database doesn't have exposure_id

B612-Asteroid-Institute / precovery

Fast precovery of small body observations at scale

BSD 3-Clause "New" or "Revised" License

6 stars 2 forks source link

Handle case when database doesn't have exposure_id #36

Open paulobarrera14 opened 2 years ago

paulobarrera14 commented 2 years ago

In ZTF database there were no exposure ids so we turned that column into a empty column of strings. This allowed the precovery to index the hdf5 file but caused other issues when running the precovery search. When we looked for a specific asteroid all the "mjd_utc" fell on the same night. We need a solution for the datasets that don't have exposure ids.

df5["exposure_id"] = df5["mjd_utc"].apply(lambda x:str(x)) df5.to_hdf('ztf_observations_610_624.h5', key = 'data', mode='w', format='table', encoding = 'utf-8')

This is what we ended up doing so that "exposure_id" could have values based on "mjd.utc"

ntellis commented 2 years ago

To clarify, how much data was indexed for this? Just one night? Were there unique values of mjd_utc for separate exposures, they were just all falling on the same night?

Otherwise I don't see any problem with using the midpoint time of the exposure as a unique identifier for that exposure. I can't see a situation where there would be overlap there, unless there's some truncation on the mjd_utc field you are referring to.

paulobarrera14 commented 2 years ago

This data was indexed over two weeks. Yes there were unique values of mjd_utc for that two week period (In the .hdf5 but not unique values in the indexed file).

Having the exposure_id column as an empty string caused some error in the indexing so that when I did a search with an extremely high tolerance all the mjd_utcs were 58364.1304861.

Joachim fixed this by adding the code I put above and re-indexing the database and wanted me to file this issue so that its documented/fixed later.

moeyensj commented 2 years ago

The issue is in this function: https://github.com/B612-Asteroid-Institute/precovery/blob/c7ac7d5df3c27a5424090aae7365f46f45298e32/precovery/sourcecatalog.py#L78

The code assumes each exposure has a unique ID, so when we pass an empty string for all exposures then the assumption is that we only loaded in one exposure. The first observation time read is then the only observation time mapped to all indexed observations.

We need to support the LSST use case where each observation in each exposure will have a slightly varying time of observation, implementing that solution could conceivably also fix this issue. Regardless, we will need to come up with an in-house schema to assign exposure IDs when the dataset has none. Or we delegate it to the user.