databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

unifying pipestat config with pipestat constructor #206

Open nsheff opened 6 months ago

nsheff commented 6 months ago

Right now, the docs suggest configuring pipestat via pypiper like this:

pm = pypiper.PipelineManager(
  ...,
  pipestat_schema="custom_results_schema.yaml",
  pipestat_results_file="custom_results_file.yaml",
  pipestat_sample_name="my_record",
  pipestat_project_name="my_namespace",
  pipestat_config="custom_pipestat_config.yaml",
) 

meanwhile, pipestat is configured like this:

psm = pipestat.PipestatManager(
    record_identifier="sample1",
    results_file_path=temp_file,
    schema_path="../tests/data/sample_output_schema.yaml",
)

I would like these to be uniform. So, I want to do:


pipestat_config = {
    "record_identifier":sample["sample_name"],
    "schema_path":"pipeline/output_schema.yaml",
    "results_file_path":"results.yaml",
    "pipeline_type":"sample"
}

And use this for either, like:

psm = pipesatat.PipestatManager(**pipestat_config)

or:

pm = pypiper.PipelineManager(
  ...,  #pypiper options
  pipestat_config=pipestat_config) 

This way, there's a single argument to PipelineManager, which accepts a dict of pipestat config options, which can be passed with **kwargs. This seems cleaner than specifying separate arguments, one for each pipestat config option. Also, it will ensure the options stay in sync -- right now they're out of sync (pypiper wants pipestat_sample_name, which it will pass to record_identifier). So, it will eliminate maintaining a bunch of pypiper argument names for consistency.

nsheff commented 6 months ago

Another issue is that I can't figure out how to map the config options to configure pipestat the way I want it. I don't know what pipestat_project_name maps to, and I don't see how to set the pipeline_type through pypiper.

nsheff commented 6 months ago

Just another example where this bit me again.

I wanted to pass multi_pipeilnes=True to pipestat, when I'm constructing my pypiper.PipelineManager, but this is not documented. The way to do it is to say multi=True to pypiper, which takes this and changes it to multi_pipelines=True passed to pipestat. I had to find this in the code itself to figure it out.

This would be easier and not require additional documentation if instead we used pipestat_config and passed through kwargs.