dandi / dandisets-healthstatus

Healthchecks of dandisets and support libraries (pynwb and matnwb)
0 stars 1 forks source link

Proposal: Store a summary of every test run in a dedicated log file #82

Open jwodder opened 1 month ago

jwodder commented 1 month ago

Proposal: Make the check and test-files subcommands of dandiset-healthstatus append one summary line per test run to one or more record file(s). Unlike the status.yaml files, entries are never overwritten or removed.

The record file(s) will be stored in the same repository used for #72. Possible paths for the file(s) include:

The record file(s) will be in JSON Lines format, with each line being a JSON object containing the following fields:

This will allow looking back through records of past test runs (e.g., to compare past and present runtimes) without having to muck about with Git history.

(Compare #83, which proposes storing the above information in the status.yaml files.)

@yarikoptic: Your thoughts?


Additional items added based on discussion below:

yarikoptic commented 1 month ago

Thank you for this and #83 -- I love idea of storing more metadata about test runs and underlying assets/blobs properties. I think I lean more toward this over #83

jwodder commented 1 month ago

@yarikoptic

I think I lean more toward this over https://github.com/dandi/dandisets-healthstatus/issues/83

So should #83 be closed as "not planned"?

May be just status.yaml could "duplicate" outcome field to mitigate.

If you'd like that, please file a dedicated issue.

what if to add status untested and add a record for an asset when new asset detected but was not scheduled to be tested yet in the run etc.

Currently, if dandisets-healthstatus detects an asset that exists on the Archive but not in status.yaml, the asset won't be added to status.yaml unless it's selected for testing. Thus, if we emit a record for new assets, the same records would be emitted every time the program is run until the respective assets are finally tested.

With code separation (as discussed in #72) keeping only summary/extract statuses in a repo sounds odd but would allow anyone to quickly clone it, so I still think that we might want to keep such a detailed log and outputs (#72) in a separate repo which we might "forget" etc, and just link it as a submodule.

To be clear, here you're just agreeing with me that we should have one repository that stores the code and a separate repository that stores all of the status.yaml files, issue 72 output records, and issue 82 event logs, correct? (However, I think making the logs+records repository a submodule of the code repository — if that's the direction you're suggesting — isn't that great of an idea, as the code isn't dependent on the logs, and updating the code repository whenever there's a commit to the logs repo would just introduce a lot of churn.)

I wonder if we should also get top level singular environments.jsonl ... and then referring to that env_id in the test record run

Would this env_id also be used to replace the versions mappings in status.yaml?

in above I assume timestamp is more of datetimestamp, i.e. would be fully fledged iso datetime stamp, right?

Yes.

jwodder commented 1 month ago

@yarikoptic Ping.

yarikoptic commented 1 month ago

May be just status.yaml could "duplicate" outcome field to mitigate.

If you'd like that, please file a dedicated issue.

I see we have already some untested but it is not test specific (e.g. here). We can add records with status untested (relates to #86) and provide untested per each test as soon as we encounter new asset.

With code separation (as discussed in #72) keeping only summary/extract statuses in a repo sounds odd but would allow anyone to quickly clone it, so I still think that we might want to keep such a detailed log and outputs (#72) in a separate repo which we might "forget" etc, and just link it as a submodule.

To be clear, here you're just agreeing with me that we should have one repository that stores the code and a separate repository that stores all of the status.yaml files, issue 72 output records, and issue 82 event logs, correct?

I meant to have separate repo only for (heavy) logs (this #82) as a submodule of (lean) repo status.yaml files. If you would like to have code/ to be a separate repo (and submodule) - could also be separated out.

(However, I think making the logs+records repository a submodule of the code repository — if that's the direction you're suggesting — isn't that great of an idea, as the code isn't dependent on the logs, and updating the code repository whenever there's a commit to the logs repo would just introduce a lot of churn.)

I had in mind making logs/ to be the submodule of this dandisets-healthstatus repo containing status.yaml files. code/ also could be submodule if so desired, not the other way around (no logs submodule within code).

I wonder if we should also get top level singular environments.jsonl ... and then referring to that env_id in the test record run

Would this env_id also be used to replace the versions mappings in status.yaml?

yes, I think it could, hopefully environments would not change too often thus complicating things up... may be environment id could be composition of versions + date or alike, e.g.

20240725-092554_hdmf-3.14.2_matnwb-2.6.0.2_pynwb-2.8.1

so this way we could then immediately see what versions of the main libraries of interest were while the actual record would provide more gory details.

jwodder commented 1 month ago

@yarikoptic

I meant to have separate repo only for (heavy) logs (this https://github.com/dandi/dandisets-healthstatus/issues/82) as a submodule of (lean) repo status.yaml files.

Do you mean that the event logs should be in a separate submodule from that used for the issue 72 output logs? Which one gets the logs/ name?

yarikoptic commented 1 month ago
jwodder commented 1 month ago

@yarikoptic By "event logs," I mean the JSON Lines files storing a summary of each test run as described by this issue. Where do you want those files to be stored?

jwodder commented 1 month ago

@yarikoptic Ping.

yarikoptic commented 1 month ago

ok, what about this repo to keep records.jsonl by the status.yaml (as rendered from records.jsonl). Then outputs/ to be submodule which would collect stdout/err for running each test as corresponding to the records in records.jsonl in this repo?

jwodder commented 1 month ago

@yarikoptic

yarikoptic commented 1 week ago
  • You stated above that you preferred dividing the event logs up by year, so should records.jsonl in your last comment be replaced by records/{year}.jsonl?

sounds good to me if we are to do per year.