B612-Asteroid-Institute / precovery

Fast precovery of small body observations at scale
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Add dataset_id to frames, add datasets table to index.db #27

Closed moeyensj closed 2 years ago

moeyensj commented 2 years ago

This PR adds a dataset_id column to the frames table:

id | dataset_id | obscode | exposure_id | filter | mjd | healpixel | data_uri | data_offset | data_length
1 | NSC_DR2 | W84 | c4d_130802_044515_ooi_z_a1 | z | 56506.198093 | 4217 | frames_00000000.data | 0 | 12829
2 | NSC_DR2 | W84 | c4d_130802_044515_ooi_z_a1 | z | 56506.198093 | 4216 | frames_00000000.data | 12829 | 690
3 | NSC_DR2 | W84 | c4d_130802_044515_ooi_z_a1 | z | 56506.198093 | 4210 | frames_00000000.data | 13519 | 412
4 | NSC_DR2 | W84 | c4d_130802_044515_ooi_z_a1 | z | 56506.198093 | 4211 | frames_00000000.data | 13931 | 7146
5 | NSC_DR2 | W84 | c4d_130802_044615_ooi_z_a1 | z | 56506.198794 | 4217 | frames_00000000.data | 21077 | 16415
6 | NSC_DR2 | W84 | c4d_130802_044615_ooi_z_a1 | z | 56506.198794 | 4216 | frames_00000000.data | 37492 | 828
7 | NSC_DR2 | W84 | c4d_130802_044615_ooi_z_a1 | z | 56506.198794 | 4210 | frames_00000000.data | 38320 | 826
8 | NSC_DR2 | W84 | c4d_130802_044615_ooi_z_a1 | z | 56506.198794 | 4211 | frames_00000000.data | 39146 | 9012
9 | NSC_DR2 | W84 | c4d_130802_044713_ooi_z_a1 | z | 56506.199465 | 4217 | frames_00000000.data | 48158 | 14546
10 | NSC_DR2 | W84 | c4d_130802_044713_ooi_z_a1 | z | 56506.199465 | 4216 | frames_00000000.data | 62704 | 828

This column is also extended to the PrecoveryCandidate and FrameCandidate classes allowing the user to know from which dataset the observation actually came which will be useful as we start indexing more data.

A datasets table that is designed to track metadata for the input dataset is also added. All quantities except the id are optional and nullable.

id | name | reference_doi | documentation_url | sia_url
NSC_DR2 | NOIRLab Source Catalog (DR2) | https://doi.org/10.3847/1538-3881/abd6e1 | https://datalab.noirlab.edu/nscdr2/index.php | https://datalab.noirlab.edu/sia/nsc_dr2

If we merge this PR, existing indexed observation databases will need to be updated with the new column and table. This can be accomplished with the following snippet:

import os
import shutil
import pandas as pd
import sqlite3 as sql

# Precovery database and indexed observations directory location
DB_DIR = "/epyc/ssd/users/moeyensj/precovery/precovery_data/nsc/precovery_defrag_db_032"
DB = os.path.join(DB_DIR, "index.db")

DB_COPY = os.path.join(DB_DIR, "index.db_backup")
if not os.path.exists(DB_COPY):
    shutil.copyfile(DB, DB_COPY)

    con = sql.connect(DB)

    frames = pd.read_sql("""SELECT * FROM frames""", con)
    frames.insert(1, "dataset_id", "NSC_DR2")
    frames.to_sql("frames", con, if_exists="replace", index=False)

    datasets = {
        "id" : ["NSC_DR2"],
        "name" : ["NOIRLab Source Catalog (DR2)"],
        "reference_doi" : ["https://doi.org/10.3847/1538-3881/abd6e1"],
        "documentation_url" : ["https://datalab.noirlab.edu/nscdr2/index.php"],
        "sia_url" : ["https://datalab.noirlab.edu/sia/nsc_dr2"],
    }
    datasets = pd.DataFrame(datasets)
    datasets.to_sql("datasets", con, if_exists="replace", index=False)
    con.close()
moeyensj commented 2 years ago

Additional commit now adds support for metadata fields in index_observations.py. The usage is as follows:

python index_observations.py /mnt/data/projects/thor/thor_data/nsc/preprocessed /mnt/data/projects/precovery/precovery_data/nsc/precovery_month_db_32_test NSC_DR2 --nside 32 --dataset_name "NOIRLab Source Catalog (DR2)" --reference_doi https://doi.org/10.3847/1538-3881/abd6e1 --documentation_url https://datalab.noirlab.edu/nscdr2/index.php --sia_url https://datalab.noirlab.edu/sia/nsc_dr2