databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

Critical bugs with reporting results #209

Open nsheff opened 5 months ago

nsheff commented 5 months ago

I just found 2 critical bugs with pypiper result reporting. Actually it may just be 2 outcomes from the same critical bug.

When you report a result it doesn't overwrite the result if there's already one there. This means there's no way to update something if an incorrect result was entered. In the past, results would be reported as many times as you ran the pipeline. All of these results are recorded. Now, the pipeline only reports the first result. If you re-report that result, it's simply lost.

When I'm re-running a pipeline to correct something, I get these:

These results exist for 'demo4': seqcol_digest
These results exist for 'demo4': Time
These results exist for 'demo4': Success

And they're not updated. So, that's the first problem: reporting of results is limited to a single value now, and multiple reports is not recorded.

The second issue is related to it:

The result is not only not reported to the results.yaml file, it's also not reported to screen! So it doesn't even show up in the log file anymore. Results should be reported to screen every time a result report is requested.

Solutions

The second issue should be easy to solve, just make sure pypiper is always printing the result. I'm not sure if this should happen in pypiper or in pipestat, but somewhere it has to happen.

The first problem is harder to solve, because it's a consequence of the switch from a simple tab-delimited stats.tsv file pypiper used to use, to the pipestat file. In pipestat, results are recorded per sample, not per pipeline run, which means there's going to be issues when you re-run a sample. The best I can think of is this:

  1. In the short term, at the very least, the last reported result should be the one used, not the first reported result, as is currently happening. I guess this is as simple as forcing updates whenever you report an existing result.
  2. In the longer term, pipestat should offer the option to include a history of results, and these should be stored somehow in the file (and database). This may not actually be too hard to implement; just add a 'history' function, and when something is overwritten, just move the old values into the history in a way that is an array, rather than a single value. Then, pipestat could offer a clear history function to remove old stuff, if desired, but otherwise, repeated reports of the same result will simply add to the history.
donaldcampbelljr commented 5 months ago

I believe the first issue is related to https://github.com/databio/pypiper/issues/201 I solved it by adding a force_overwrite flag to both report_result and report_object

donaldcampbelljr commented 4 months ago

I switched force_overwrite to default to True and confirmed that it will overwrite the results and these interactions are saved in log.md and only the most recent is in the results.yaml file (as expected).

import pypiper

pm = pypiper.PipelineManager(name="TEST_OVERWRITING",outfolder="/home/drc/PythonProjects/pipestat/opt_dependencies",
                             pipestat_record_identifier="pypiperRecordIdentifier1", pipestat_schema="sample_output_schema.yaml", 
                             pipestat_pipeline_type="sample", pipestat_results_file="results.yaml")
print("pm made")
pm.report_result(key="number_of_things", value=100)
print("\ndone reporting a result 1st time")
pm.report_result(key="number_of_things", value=200)
print("\ndone reporting a result 2nd time")
pm.report_result(key="number_of_things", value=300)
print("\ndone reporting a result 3rd time")
pm.complete()

Log excerpt:

pm made

> `number_of_things`    100 _RES_

> `pipestat_created_time`   2024-04-04 14:16:54 _RES_

> `pipestat_modified_time`  2024-04-04 14:16:54 _RES_

done reporting a result 1st time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`    200 _RES_

> `pipestat_modified_time`  2024-04-04 14:16:54 _RES_

done reporting a result 2nd time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`    300 _RES_

> `pipestat_modified_time`  2024-04-04 14:16:54 _RES_

done reporting a result 3rd time

Results file:

test_pipe:
  project: {}
  sample:
    pypiperRecordIdentifier1:
      number_of_things: 300
      pipestat_created_time: '2024-04-04 14:16:54'
      pipestat_modified_time: '2024-04-04 14:16:54'

Next steps will be to add history enhancements to Pipestat via https://github.com/pepkit/pipestat/issues/177

donaldcampbelljr commented 3 months ago

Closed with 0.14.1 release and pipestats 0.9.0 release.