Proposal: Store a summary of every test run in a dedicated log file

jwodder commented 1 month ago

Proposal: Make the check and test-files subcommands of dandiset-healthstatus append one summary line per test run to one or more record file(s). Unlike the status.yaml files, entries are never overwritten or removed.

The record file(s) will be stored in the same repository used for #72. Possible paths for the file(s) include:

records.jsonl — one file for everything
{dandiset_id}/.dandi/records.jsonl — one record file per Dandiset (placed under .dandi/ so as not to collide with any assets)
{dandiset_id}/.dandi/records/{year}.jsonl — one record file per Dandiset and year

The record file(s) will be in JSON Lines format, with each line being a JSON object containing the following fields:

asset (object) — details on the asset the test was run on; fields:
- dandiset_id (string)
- dandiset_version (string) — For now, this will always be "draft"
- path (string) — the asset path
- size (integer) — the asset's blob size
- (maybe?) asset_id (string)
- (maybe?) blob_id (string)
- (maybe?) modified (timestamp) — the asset's modified property from the Archive
- (maybe?) blob_modified (timestamp) — the asset's blobDateModified metadata field
test (string) — the name of the test ("pynwb_open_load_ns" or "matnwb_nwbRead")
timestamp (timestamp) — starting time of the test process
duration (float) — duration of the test process in seconds
outcome (string) — "pass", "fail", or "timeout"
environment — an object listing the versions of software used by the tests, the same as the versions mappings currently present in the status.yaml files
- (maybe) environment could also include a dandisets-healthstatus field whose value equals the current (short?) Git commit hash of the code

This will allow looking back through records of past test runs (e.g., to compare past and present runtimes) without having to muck about with Git history.

(Compare #83, which proposes storing the above information in the status.yaml files.)

@yarikoptic: Your thoughts?

Additional items added based on discussion below:

Also emit records for new, untested assets
- To implement this, whenever a new asset is found, add an entry for it to status.yaml under each test with status "untested" (or "new"?).
Replace both environment above and the versions field in status.yaml with env_id, an ID into a separate JSON (or JSON Lines?) file
- Environment info should (in addition to current software versions) include at least uname -a, numpy version, and h5py version.
- Possibilities for environment IDs:
  - Timestamp at which entry was added to environments file
  - Combination of versions & date, e.g., 20240725-092554_hdmf-3.14.2_matnwb-2.6.0.2_pynwb-2.8.1

yarikoptic commented 1 month ago

Thank you for this and #83 -- I love idea of storing more metadata about test runs and underlying assets/blobs properties. I think I lean more toward this over #83

I still think that groupping by pass/fail is very useful, so I dislike that #83 would remove that
- on the other hand groupping is a pain if I have an asset in mind and then need to identify if it passed or failed - getting to the top of the group name is a pain. May be just status.yaml could "duplicate" outcome field to mitigate.
I love an idea of a "audit of runs" (records) proposed in this issue since indeed it could be handy to see differences in test run times etc
- We seems have no ability to record full list of assets like in #83: what if to add status untested and add a record for an asset when new asset detected but was not scheduled to be tested yet in the run etc. Then we could also deduce how long it took from new asset detection to actually getting it tested
- Then currents groupped status.yaml files then could be just "reproduced" from records at any point.
- Keeping records per dandiset_id would be helpful
- {dandiset_id}/.dandi/records/{year}.jsonl sounds good to me since would allow for easier cleaning
- given that we have thousands of assets, I have concerns similar to #72 for growing history but we might "avoid" it with that last option I thought of analogous to git-annex forget (and may be git replace grafting to do keep some archive with full history).
With code separation (as discussed in #72) keeping only summary/extract statuses in a repo sounds odd but would allow anyone to quickly clone it, so I still think that we might want to keep such a detailed log and outputs (#72) in a separate repo which we might "forget" etc, and just link it as a submodule.
environment: If we are to contain all such good provenance, I wonder if we should also get top level singular environments.jsonl where we would create records (env_id identified by a isodate stamp when change in setup detected) providing more detailed description of the system, beyond only versions of top level modules (output of uname -a for kernel info, versions of numpy, h5py etc) which we collect upon initiation of a run, compare to last one and possibly record a new one; and then referring to that env_id in the test record run (instead of versions of top level modules). This would allow to notice changes in setup, identify test runs for the same setup, etc. I believe this is somewhat reminiscent on what asv does in its machine.json or alike.
in above I assume timestamp is more of datetimestamp, i.e. would be fully fledged iso datetime stamp, right?

jwodder commented 1 month ago

@yarikoptic

I think I lean more toward this over https://github.com/dandi/dandisets-healthstatus/issues/83

So should #83 be closed as "not planned"?

May be just status.yaml could "duplicate" outcome field to mitigate.

If you'd like that, please file a dedicated issue.

what if to add status untested and add a record for an asset when new asset detected but was not scheduled to be tested yet in the run etc.

Currently, if dandisets-healthstatus detects an asset that exists on the Archive but not in status.yaml, the asset won't be added to status.yaml unless it's selected for testing. Thus, if we emit a record for new assets, the same records would be emitted every time the program is run until the respective assets are finally tested.

With code separation (as discussed in #72) keeping only summary/extract statuses in a repo sounds odd but would allow anyone to quickly clone it, so I still think that we might want to keep such a detailed log and outputs (#72) in a separate repo which we might "forget" etc, and just link it as a submodule.

To be clear, here you're just agreeing with me that we should have one repository that stores the code and a separate repository that stores all of the status.yaml files, issue 72 output records, and issue 82 event logs, correct? (However, I think making the logs+records repository a submodule of the code repository — if that's the direction you're suggesting — isn't that great of an idea, as the code isn't dependent on the logs, and updating the code repository whenever there's a commit to the logs repo would just introduce a lot of churn.)

I wonder if we should also get top level singular environments.jsonl ... and then referring to that env_id in the test record run

Would this env_id also be used to replace the versions mappings in status.yaml?

in above I assume timestamp is more of datetimestamp, i.e. would be fully fledged iso datetime stamp, right?

Yes.

jwodder commented 1 month ago

@yarikoptic Ping.

yarikoptic commented 1 month ago

May be just status.yaml could "duplicate" outcome field to mitigate.

If you'd like that, please file a dedicated issue.

https://github.com/dandi/dandisets-healthstatus/issues/86

what if to add status untested and add a record for an asset when new asset detected but was not scheduled to be tested yet in the run etc.

Currently, if dandisets-healthstatus detects an asset that exists on the Archive but not in status.yaml, the asset won't be added to status.yaml unless it's selected for testing. Thus, if we emit a record for new assets, the same records would be emitted every time the program is run until the respective assets are finally tested.

I see we have already some untested but it is not test specific (e.g. here). We can add records with status untested (relates to #86) and provide untested per each test as soon as we encounter new asset.

With code separation (as discussed in #72) keeping only summary/extract statuses in a repo sounds odd but would allow anyone to quickly clone it, so I still think that we might want to keep such a detailed log and outputs (#72) in a separate repo which we might "forget" etc, and just link it as a submodule.

To be clear, here you're just agreeing with me that we should have one repository that stores the code and a separate repository that stores all of the status.yaml files, issue 72 output records, and issue 82 event logs, correct?

I meant to have separate repo only for (heavy) logs (this #82) as a submodule of (lean) repo status.yaml files. If you would like to have code/ to be a separate repo (and submodule) - could also be separated out.

(However, I think making the logs+records repository a submodule of the code repository — if that's the direction you're suggesting — isn't that great of an idea, as the code isn't dependent on the logs, and updating the code repository whenever there's a commit to the logs repo would just introduce a lot of churn.)

I had in mind making logs/ to be the submodule of this dandisets-healthstatus repo containing status.yaml files. code/ also could be submodule if so desired, not the other way around (no logs submodule within code).

I wonder if we should also get top level singular environments.jsonl ... and then referring to that env_id in the test record run

Would this env_id also be used to replace the versions mappings in status.yaml?

yes, I think it could, hopefully environments would not change too often thus complicating things up... may be environment id could be composition of versions + date or alike, e.g.

20240725-092554_hdmf-3.14.2_matnwb-2.6.0.2_pynwb-2.8.1

so this way we could then immediately see what versions of the main libraries of interest were while the actual record would provide more gory details.

jwodder commented 1 month ago

@yarikoptic

I meant to have separate repo only for (heavy) logs (this https://github.com/dandi/dandisets-healthstatus/issues/82) as a submodule of (lean) repo status.yaml files.

Do you mean that the event logs should be in a separate submodule from that used for the issue 72 output logs? Which one gets the logs/ name?

yarikoptic commented 1 month ago

I cited this 82 although indeed should have been 72
so it is a repo with stdout/stderr outputs of test runs (is you mean something else by an event, please correct me)
I think it is this repo which should have logs/ folder but it should be a submodule pointing to dandisets-healthstatus-logs which would carry actual logs.

jwodder commented 1 month ago

@yarikoptic By "event logs," I mean the JSON Lines files storing a summary of each test run as described by this issue. Where do you want those files to be stored?

jwodder commented 1 month ago

@yarikoptic Ping.

yarikoptic commented 1 month ago

ok, what about this repo to keep records.jsonl by the status.yaml (as rendered from records.jsonl). Then outputs/ to be submodule which would collect stdout/err for running each test as corresponding to the records in records.jsonl in this repo?

jwodder commented 1 month ago

@yarikoptic

You stated above that you preferred dividing the event logs up by year, so should records.jsonl in your last comment be replaced by records/{year}.jsonl?
I don't really like keeping the status.yaml and event records in this repository when we have another one for the output logs, but OK.

yarikoptic commented 1 week ago

You stated above that you preferred dividing the event logs up by year, so should records.jsonl in your last comment be replaced by records/{year}.jsonl?

sounds good to me if we are to do per year.

dandi / dandisets-healthstatus

Proposal: Store a summary of every test run in a dedicated log file #82