File Tracking Database Needs

villarrealas commented 5 years ago

This can currently be found at: /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/run201903/outputs/00201989to00262897/prototype_transfer_database.db

Using Python and sqlite3, you can pull the relevant data as follows:

import sqlite3 conn = sqlite3.connect('prototype_transfer_database.db') c = conn.cursor() data = c.execute('SELECT * FROM VISIT').fetchall() visit = data[0]

This issue is to help us track what changes would be useful for the community, so that it can be better integrated and automated by the workflow.

katrinheitmann commented 5 years ago

Is it possible to rename a github repo? It's pretty confusing that this is called ALCF_1.2i (which is long over). Thanks!

villarrealas commented 5 years ago

I think Adrian needs to rename the repository, but we could definitely rename it.

airnandez commented 5 years ago

I would like to make some suggestions, assuming the purpose of this database is to track the progress of the simulation campaign for a given run.

Structure the database around two tables, visits and files, instead of a single table
The visits table would contain the information about each visit (one visit per row) and the files table would be used to record the set of files belonging to a given visit (one file per row).
The visits table could have the following attributes:
- visit_id: identifier of this visit (e.g. 123456)
- created_on: time stamp this visit was recorded in the database, in UTC, format RFC3339, eg. 2019-04-10T15:04:05Z. Useful to retrieve the newly created visits since a given date.
- run_id: identifier of the run this visit belongs to (e.g. "Run2.1i")
- status: status of this visit (e.g. "prod", "dev", "test",...). Only data associated to visits with status prod are considered for the image processing pipelines and therefore relevant for the inter-site data transfer system.
- filter: the filter used for the images of this visit (e.g. u, g, r, ...)
- simulator_id: the identifier of the software used for simulating this visit (e.g. imsim v0.5.0). Question: do we need to record also the configuration (or command line options) used for producing this visit?
- year: the simulation campaign year this visit belongs to
- field: e.g. WFD, UDF
- completed: boolean flag to mark a visit as completed. A completed visit is one for which all the files have been simulated and quality controlled.
The files table could have the following attributes:
- visit_id: the identifier of the visit (in the visits table), this file belongs to
- production_site: identifier of the site this file was produced (e.g. NERSC)
- path: the absolute path of this file where this file is located (e.g. /global/projecta/projectdirs/lsst/...)
- raft_id: identifier of the LSST camera raft (e.g. (3,2)). See here
- sensor_id: identifier of the LSST camera sensor, within the raft, e.g. (2,1). See here
- quality_controlled: boolean flag to record that this file has already passed all the quality control procedures and is therefore considered good to be used for the image processing pipelines.
- created_on: time stamp this file was recorded in the database, in UTC, format RFC3339, eg. 2019-04-10T15:04:05Z. Useful to retrieve the newly created files since a given date.

The data transfer system would read the production database to extract the relevant information (e.g. recently created visits, recently added files) necessary to trigger the regular inter-site data transfers. The data transfer system would use a different database to track which files (and visits) have already been transferred.

I would also like to mention some concern of using sqlite which is not designed to support concurrent write access. Although it is a very robust and convenient tool, in our use case there may be a relatively high number of concurrent updates to the database, which requires an exclusive write lock to the sqlite file (this is internally done by the sqlite library). This may be a costly operation in a shared file system.

Hope this helps.

Edits: use created_on instead of timestamp as the name of the column for both visits and files tables. Use raft_id, sensor_id, visit_id, etc. instead of raftID, sensorID, visitID, etc. for consistency.

villarrealas commented 5 years ago

@airnandez Thanks for the detailed thoughts! If we think that there is going to be concurrent updates to the database, there is certainly merit to avoiding using sqlite3 for these purposes. Would you have a particular database tool in mind to replace this?

airnandez commented 5 years ago

Hosting the database in a central server (e.g. PostgreSQL) could be an alternative to consider but it would actually depends on which center would host this database (e.g. NERSC, ANL, etc.).

I would suggest to continue with sqlite for this development phase. If we decide/need to host the database on a server, the structure of the database should not change too much, although the Python code to populate and read the database would change. You may want to consider SQLalchemy to mitigate this risk.

villarrealas commented 5 years ago

@airnandez Changing to SQLalchemy has been a lifesaver, so thank you for pointing that out to me.

Anyway, the requested changes have been made and a new prototype is ready for you to play with. There are some mild changes worth noting:

1) Year is currently not a field that is utilized. However, that is something that should be a trivial cut based on visit_id.

2) Similarly, field is currently unimplemented. Right now I do not know of any distinction in the pipeline that would allow for an automated distinguishing between WDF and DDF fields. I imagine this might change in practice, and then I'll work something in accordingly.

3) raft_id + sensor_id are stored as a string of two numbers corresponding to the classic scheme. This simply matches file output format a little better.

Anyway, the same basic structure works, with tables named 'VISITS' and 'FILES' to pull from. It is located at: /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/run201903/outputs/00201989to00262897/sqlite_tracking_prototype.db.

airnandez commented 5 years ago

Thanks @villarrealas. The database contents looks good and seems to contain all the information we need to identify what files are ready to transfer and when to trigger the transfers.

As I see things, the file transfer system would periodically scan this database for new completed visits since the last scan and queue those files for transfer. I will build a prototype of such as system and start testing it.

I propose to leave this issue open until I have something functional, in case we need to do some modifications to the database.

airnandez commented 5 years ago

Looking at the contents of the prototype database, I noticed that there are visits marked as completed but the files belonging to them are not marked as quality-controlled.

I would like to suggest we adopt a convention on the semantics of a completed visit. From the point of view of the inter-site data transfer, I would propose we consider a visit completed once all its files are ready for processing, therefore can be transferred: that would mean, once all (or the appropriate subset of) the files belonging to that visit have been checked and considered good for image processing.

Any other convention would be good, but the important thing in my opinion is that we agree on the meaning of those flags in the database.

There was a discussion about this in today's DC2 meeting: @jchiang87 or @heather999 may want to comment.

heather999 commented 5 years ago

I finally took a look at the schema for the two tables: files and visits:

sqlite> .schema files
CREATE TABLE files (
    file_path VARCHAR(250) NOT NULL, 
    visit_id INTEGER, 
    production_site VARCHAR(250) NOT NULL, 
    raft_id VARCHAR(250) NOT NULL, 
    sensor_id VARCHAR(250) NOT NULL, 
    quality_controlled BOOLEAN NOT NULL, 
    created_on VARCHAR(250) NOT NULL, 
    PRIMARY KEY (file_path), 
    FOREIGN KEY(visit_id) REFERENCES visits (visit_id), 
    CHECK (quality_controlled IN (0, 1))
);
sqlite> .schema visits
CREATE TABLE visits (
    visit_id INTEGER NOT NULL, 
    created_on VARCHAR(250) NOT NULL, 
    run_id VARCHAR(250) NOT NULL, 
    status VARCHAR(250) NOT NULL, 
    filter_id VARCHAR(250) NOT NULL, 
    simulator_id VARCHAR(250) NOT NULL, 
    completed BOOLEAN NOT NULL, 
    PRIMARY KEY (visit_id) ON CONFLICT IGNORE, 
    CHECK (completed IN (0, 1))
);

I think it is still useful to have a field that denotes that a visit is done in terms of processing, which is what I believe the "completed" field indicates. But I am also wondering how we treat visits that are "completed", but one or more of the visit's files fail "quality controlled". Recently we have been skipping such visits entirely in our DM processing. Would we want IN2P3 to be checking for "completed" visits and check on the "quality_controlled" status of each file in that visit? Or should we have a field in the visits table that indicates whether all files passed quality_controlled? And if not all files pass "quality_controlled", do we skip that visit in the transfer?

We also discussed the possibility of adding a field to the visit table that indicates whether this visit is in the DDF or not. @jchiang87 mentioned he has code that can be used to make that determination.

Once the tables are settled, who will be in charge of filling the quality_controlled field and any DDF field? The descim collaboration user owns the sqlite db - so this can be handled by anyone with access to that collab account. I'm planning to do a check on the exiting 2 years of data to see if they generally pass our "quality_controllled" criteria (but will not be updating this sqlite DB at this point):

No checkpoint and lsst_a file pairs
16 HDUs in the lsst_a*.fits files
lsst_a*.fits files pass fverify

jchiang87 commented 5 years ago

Is there some reason we aren't processing a visit unless it has all 189 sensors? As far as I can tell that's only problematic when making sky flats and maybe sky frames, but we aren't using sky exposures for either of those purposes. I think we should transfer and run through DM image processing all raw files that have 16 complete image HDUs.

heather999 commented 5 years ago

That's not what I meant exactly.. we are not processing visits where one or more of the FITs files associated with that visit was found to be incomplete.. meaning that we found a checkpoint and lsst_a*.fits pair that indicated something went wrong with the FITs writing and it terminated abnormally. We certainly would process visits with less than 189 sensors, if that is all that visit was meant to include. In that case, I would expect both "completed" to be True and all the associated fits files to pass "quality_controlled"... so that visit would be transferred and processed.

It may be that we should just go ahead and process completed visits, where some of the files do not pass "quality_controllled.. in that case, I would expect IN2P3 to transfer all files that do pass "quality_controlled", and ignore any "quality_controlled"=False files. However we would need to keep track of the fact that we do not have the complete set of files associated with that visit - I think...at least somewhere... so that the next time IN2P3 does a transfer, it could pick up any "fixed" files that now do pass "quality_controlled"... if we plan to re-run those sensor visits.

villarrealas commented 5 years ago

Apologies for being a bit late on this - I was a bit out of commission yesterday with a nasty cold. Now I can think again though!

I took the "completed" field under visit to signify that all sensors for that visit have been confirmed as generated. That seemed to make sense to me in terms of automation. I am inclined to agree along the lines of Heather, where I think the behavior would be to pull from completed visits (in terms of sensors generated) with all quality controlled files.

That being said, it would be fairly trivial to add an additional "quality_completed" flag for the visit to show that all files have went under quality control as well. Maybe rename "completed" to "sensors_completed". Would that be good for everyone?

heather999 commented 5 years ago

Someone should probably respond to your question @villarrealas.. sorry.. I think adding a "quality_completed" flag to the visit table and renaming "completed" to maybe "visit_completed" would be good. But that's semantics whether it should be "sensors_completed" or "visit_completed" - my little brain understands "visit_completed" better.

airnandez commented 5 years ago

I would suggest:

to add a column to the visits table named quality_controlled of boolean type. The semantics of this column is that when its value is true, all the files belonging to this visit have been both generated and quality-controlled.
to rename the column completed of the visits table to visit_completed. When the value of this column is true, all the files belonging to this visit have been generated, but not necessarily quality-controlled yet.

The intention is to trigger the transfer of files to IN2P3 as soon as they are quality-controlled. Marking the visits as completed is convenient (although not strictly necessary) to understand that no more files are expected to be added to a visit, so if all its files have already been transferred, as far as the transfer process is concerned there is nothing more to do with that visit.

heather999 commented 5 years ago

I've taken a fresh look at the contents of the most recent DB, where @airnandez's suggestions have been applied:

/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/Run2.1i-y2-wfd/outputs> sqlite3 sqlite_tracking_prototype.db 
SQLite version 3.8.10.2 2015-05-20 18:17:19
Enter ".help" for usage hints.
sqlite> .tables
files   visits
sqlite> .schema files
CREATE TABLE files (
    file_path VARCHAR(250) NOT NULL, 
    visit_id INTEGER, 
    production_site VARCHAR(250) NOT NULL, 
    raft_id VARCHAR(250) NOT NULL, 
    sensor_id VARCHAR(250) NOT NULL, 
    quality_controlled BOOLEAN NOT NULL, 
    created_on VARCHAR(250) NOT NULL, 
    PRIMARY KEY (file_path), 
    FOREIGN KEY(visit_id) REFERENCES visits (visit_id), 
    CHECK (quality_controlled IN (0, 1))
);
sqlite> .schema visits
CREATE TABLE visits (
    visit_id INTEGER NOT NULL, 
    created_on VARCHAR(250) NOT NULL, 
    run_id VARCHAR(250) NOT NULL, 
    status VARCHAR(250) NOT NULL, 
    filter_id VARCHAR(5) NOT NULL, 
    simulator_id VARCHAR(250) NOT NULL, 
    field VARCHAR(5) NOT NULL, 
    quality_completed BOOLEAN NOT NULL, 
    visit_completed BOOLEAN NOT NULL, 
    PRIMARY KEY (visit_id) ON CONFLICT IGNORE, 
    CHECK (quality_completed IN (0, 1)), 
    CHECK (visit_completed IN (0, 1))
);

@villarrealas are there still plans to add a column to the visits table to indicate WFD vs DDF?

I am taking the quality_controlled column in the files table to be set True when the file has been checked AND passes those checks, otherwise, quality_controlled is left set to False.
I have completed the checks on the existing y2 files, and will create a backup of the current DB and updating to set quality_controlled appropriately in both the files and visits tables.

villarrealas commented 5 years ago

You can actually see that already in - field refers to if it is WFD or DDF. Currently this is put in by hand, but in the future I'll be automating that as something pulled from our Parsl database (which should be able to determine this using the RA/DEC during image trimming).

airnandez commented 5 years ago

The prototype file tracking database was moved. It is now at NERSC at the path below:

/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/Run2.1i-y2-wfd/outputs/sqlite_tracking_prototype.db

heather999 commented 5 years ago

@airnandez I'm trying to recall where we left things on Monday - do you have everything you need in the DB? I set the quality_controlled and quality_completed flags. Is it possible to start transferring all the Y2-WFD data or do we need to work any other details? @villarrealas I think you have a list of visits that overlap batch 1? Where is that file?

airnandez commented 5 years ago

As far as I have seen, we have in the tracking database all the information we need for triggering the transfers.

As I write this, we are doing some cleaning of the storage at IN2P3 because we were reaching our allocation.

In addition, I would like us to make a decision on the namespace we are going to use for Run2.1i. I just commented on a related issue here. Since my intention is that we follow the same namespace conventions at IN2P3 (under a different top directory), it would be helpful to have the data organised at NERSC following the agreed conventions, before triggering the transfers.

In addition, we should make sure that the reference document in the wiki is kept up-to-date and consistent to whatever is decided regarding the naming conventions.

airnandez commented 5 years ago

This is some feedback on the usage of the simulation campaign tracking database for driving the transfer of simulated images from NERSC to CC-IN2P3. The first transfer campaign that used the database located at NERSC at

/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/sim/sqlite_tracking_v1.db

is documented here.

The things I see we can improve in the database are documented below, even if some of them do not directly affect the file transfer system:

`visits` table

1️⃣ All the rows have the same timestamp in the created_on column. This is likely due to the fact that the database was populated post-facto, for including the reorganisation of the namespace on disk. It is important that new entries in the visits table use the actual creation timestamp of the visit, otherwise, for future transfer campaigns, this would prevent from selecting visits based on the creation date of the visit:

   select distinct created_on from visits;
   2019-05-17T14:50:31.487270+00:00

2️⃣ The values of the column filter_id are not consistent:

   select distinct filter_id from visits limit 10;
   z
   node630
    u
   node174
   y
   i
   node570
   node3769
   node340
   node493

3️⃣ The value of the column quality_completed is zero (0) for all the rows:

   select distinct quality_completed from visits;
   0

Again, very likely due to the re-population of the database.

4️⃣ I would suggest that for consistency, the value of the run_id column match the directory where the visit is actually stored:

   select distinct run_id from visits limit 10;
   Run2.1i_y2wfd

I would propose this value to be Run2.1i_y1-y2-wfd, to be consistent with the name of the directory where the visit is stored which is named y1-y2-wfd.

`files` table

:one: The values of the raft_id column are not consistent. This column seems to have a invalid value for files of type checkpoint:

   select distinct raft_id  from files limit 10;
   21
   32
   01
   22
   un2.1i/sim/y1-y2-wfd/00201989to00262897/00243020/ckpt
   41
   23
   30
   03
   33

2️⃣ The values of the sensor_id column are not consistent:

   select distinct sensor_id from files limit 10;
   00
   12
   21
   im/Run2.1i/sim/y1-y2-wfd/00201989to00262897/00243020/ckpt
   10
   22
   01
   11
   02
   20

3️⃣ The values of the quality_controlled column are all set to zero (0):

   select distinct quality_controlled from files;
   0

4️⃣ The column created_on should be the date each individual file was last modified:

   select distinct created_on from files;
   2019-05-17T14:50:31.487270+00:00

When re-populating the database, one can use the date of last modification of the file, as reported by the file system.

missing features

If for production purposes we need to use the database to keep track of the centroid and checkpoint files (in addition to the images), we would need to add a type to each file, e.g. image, checkpoint, centroid, etc.

This would help the transfer system to select what files to avoid transferring: for instance, checkpoint files are not worth transferring. Although it would be possible to select the files to transfer based on the name of the file, doing so would make a coupling between the transfer system and the tool populating the database: any change in the paths (i.e .fits --> .fits.gz) would require modifications to the transfer system.

If we have the type of each file recorded in the database, the transfer system would select only the image files which have been quality controlled since the date of the last transfer campaign.
To avoid big modifications of the database when we need to move the simulation data to another location in the file system, we could record relative paths instead of absolute paths.

We could add a column to the visits table to record the absolute path of the directory where all the files belonging to each visit are located, e.g.
```
/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/sim/y1-y2-wfd/00000000to00071840/00068022
```
and for each file in that visit we would only record the path of the file relative to the top directory of the visit, e.g.
```
lsst_a_68022_R30_S00_z.fits
```
When moving the data around, the database would need to be updated by updating the top directory of each visit only, instead of modifying the paths of all the files.
To prepare for the unavoidable future modifications in the database schema, we could record the schema version number in the database itself. That would allow the transfer system (and any other consumer of the data in the database) to adapt to the the evolutions in the schema. For instance, we could have a settings table with columns key and value (both strings) to store key-value pairs such as key: schema_version, value: 5.

heather999 commented 5 years ago

I have populated the quality_controlled field in the files table and the quality_completed field in the visits table. This version of the database is now named: /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/sim/sqlite_tracking_v2.db

We should sort out how to clean up the discrepancies that Fabio pointed out in the preceding comment.

villarrealas commented 5 years ago

Noo! It should have been v1_1! Nitpicking though.

I'll look into some of these changes and fix them for future runs. I will say that quality_controlled/quality_complete were reset due to having to repopulate the database postfact, while the timestamps are in a similar issue. I'll look into what is going on with filter_id, raft_id, and sensor_id and work on improving visit_id. A file type column should be easy enough to implement. I also see the advantage of storing relative file paths as opposed to absolute file paths.

I'll also add a table for database settings. This may take a little bit of time for me to get back to, but I think I should be able to handle all the modifications without resetting the whole table (I hope).

heather999 commented 5 years ago

Rename the file then or I can.. just so long as we have a naming convention.

How are the timestamps determined?

villarrealas commented 5 years ago

Timestamps are chosen by system time when the initial entry is created currently.

I was mostly joking about the name, by the by. It is good habit though to keep it at v1_x if we're not breaking anything for previous users and move it to v2 if we do something that would break it.

heather999 commented 5 years ago

now renamed sqlite_tracking_v1_1.db

heather999 commented 5 years ago

For Run2.1.1i agn-test while we were originally planning to use the same sqlite_tracking DB that has been in use for Run2.1i, due to technical reasons, there is a new tracking database which only contains the Run2.1.1i agn-test data: /global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1.1i/sim/sqlite_tracking_agn-test.db Antonio noted the reason for a new DB: I identified an issue in my old schema where we have issues handling files with identical base names, so adding on with a different run_id is not an option at this time. I’ll need to think of how to improve that handling. I will be adding a comment to https://github.com/LSSTDESC/ImageProcessingPipelines/issues/86 to request the data transfer to CC.

LSSTDESC / DESC_DC2_imSim_Workflow