LUMC / pytest-workflow

Configure workflow/pipeline tests using yaml files.
https://pytest-workflow.readthedocs.io/en/stable/
GNU Affero General Public License v3.0
63 stars 9 forks source link

Add a pytest-workflow generate command line function #193

Open rhpvorderman opened 5 months ago

rhpvorderman commented 5 months ago

Command line invocation:

pytest-workflow-generate tests/test_bla.yml my_name command --flag --another-flag settings/settings.json myworkflow.format

First argument is the test file to generate. Second argument is the test name. All the other arguments are treated as the command argument.

Resulting yaml:

- name: my_name 
  command: "command --flag --another-flag settings/settings.json myworkflow.format"
  files:
    - path: relative/to/workflow/output/dir/my_file.txt
      md5sum: "abcdef0123456789"

Etc.

This is especially useful for generating all the file paths. stdout and stderr are omitted as these contain timestamps and dates.

This feature is inspired by nf-tests snapshot function.

@DavyCats, @Redmar-van-den-Berg what do you think of this?

Redmar-van-den-Berg commented 5 months ago

It would be neat to have a way to generate tests, or at least the file paths that are produced. However, I'm not a big fan of testing for the checksum of the output files, since it is impossible to tell what went wrong when it changes.

How about adding a "contains" for the first line of the file? That way you can also include a test for stderr and stdout.

rhpvorderman commented 5 months ago

However, I'm not a big fan of testing for the checksum of the output files, since it is impossible to tell what went wrong when it changes.

I agree. However, it is trivial to delete the md5sums afterwards if you don't need them. If you want a bit-for-bit reproducible workflow it is quite useful that this work is already done.

So, this should be a CLI option? pytest-workflow-generate --md5sum will get you all the md5sums? That sounds like an excellent idea.

How about adding a "contains" for the first line of the file? That way you can also include a test for stderr and stdout.

That would be ##fileformat=vcf4.4 or something and for cutadapt: This is cutadapt 4.2. Not very informative, and also prone to breaking tests when program versions are upgraded. I like the idea of automating some of the tediousness out of creating contains tests, but it is really hard to come up with a good universal criterion.

Redmar-van-den-Berg commented 5 months ago

Ideally, we would only generate tests that we know will pass, but of course we cannot even know that the same files will be generated when you run command a second time. Although I think it is sensible to assume that the paths will be the same, and those are also the most annoying to type in manually.

Additionally, we can allow the user to specify if they want additional tests on the files, like --md5sum or --contains, --contains-regex etc.