databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

KeyError: 'Time' when using pipestat via pypiper #207

Open nsheff opened 6 months ago

nsheff commented 6 months ago

When I'm trying to switch from a normal pypiper pipeline to one that configures pipestat, I'm getting this error:

Traceback (most recent call last):
  File "/home/nsheff/code/seqcolapi/analysis/pipeline/add_to_seqcol_server.py", line 92, in <module>
    pm.stop_pipeline()
  File "/home/nsheff/.local/lib/python3.11/site-packages/pypiper/manager.py", line 2106, in stop_pipeline
    self.report_result("Time", elapsed_time_this_run, nolog=True)
  File "/home/nsheff/.local/lib/python3.11/site-packages/pypiper/manager.py", line 1616, in report_result
    reported_result = self.pipestat.report(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 99, in inner
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nsheff/.local/lib/python3.11/site-packages/pipestat/pipestat.py", line 571, in report
    schema=self.result_schemas[r],
           ~~~~~~~~~~~~~~~~~~~^^^
KeyError: 'Time'

I can't track this because I'm not doing anything related to Time. so it must be coming from pypiper or pipestat somehow.

nsheff commented 6 months ago

One hint is this message:

These results exist for 'DEFAULT_SAMPLE_NAME': Time
These results exist for 'DEFAULT_SAMPLE_NAME': Success

It looks like there might be a bug somewhere with a constant that is getting stored as a string instead.

nsheff commented 6 months ago

I think pipestat_sample_name is not being passed through to pipestat

nsheff commented 6 months ago

actually I think it's pipestat_results_file that's not working correclty...

nsheff commented 6 months ago

I figured it out.

Pypiper automatically adds results for Time and Success. If those aren't in your output schema, it fails. So you have to add this to the output schema:

  Time:
    type: "string"
    description: "Elapsed time for the pipeline run as reported by pypiper"
  Success:
    type: "string"
    description: "Timestamp for when the pipeline completed"

I think this is suboptimal, since I am not putting those in, they're automatic. Maybe pypiper should be the one adding them to the output schema, since it's the one reporting them automatically.

nsheff commented 6 months ago

I made a more informative error message in pipestat to address this here: https://github.com/pepkit/pipestat/commit/0d511b5960d460b4dda701379f6a982e3f407a0c

This at least solves the immediate issue, but going forward:

donaldcampbelljr commented 6 months ago

Also confirmed this by adding the output_schema to the Pipelinemanager during the test_pipeline_manager.py test (I was initially surprised our tests didn't catch this):

        self.pp = pypiper.PipelineManager(
            "sample_pipeline", outfolder=self.OUTFOLDER, multi=True, pipestat_schema="/home/drc/GITHUB/pypiper/pypiper/tests/Data/sample_output_schema.yaml"
        )

It will indeed fail with a KeyError: tests/pipeline_manager/test_pipeline_manager.py::PipelineManagerTests::test_me - KeyError: 'Time'