broadinstitute / palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
BSD 3-Clause "New" or "Revised" License
19 stars 8 forks source link

Add MatchFingerprints utility WDL #156

Closed rickymagner closed 12 months ago

rickymagner commented 12 months ago

This PR adds a utility for handling various applications of fingerprinting. In particular, it should have functionality to perform the following:

It should be simple to drop into WDLs which require matched files for the same samples (like benchmarking query vs truth data), and use the resulting matched_pairs output of the task to ensure you only act on fingerprint-matched pairs of files. See the README edits for some more details.

rickymagner commented 12 months ago

Just made some changes to try to simplify the code and also require indices. I tried making index files optional, but this led to the discovery of a Cromwell bug for one particular implementation (issue opened in their repo), so I abandoned that for now.

There has been some ongoing discussion about a possible Terra bug where GATK cannot stream from a requester-pays bucket. This is particularly important for this workflow in being used for fingerprinting samples against our NIST requester-pays mirror, so I'll try to follow up on that in the future depending on how the Terra support ticket resolves, which may or may not require changes to this code to resolve. For now, this workflow should work fine streaming any files from normal GCP buckets.

rickymagner commented 12 months ago

Good catch. Just updated the dockstore yml now