DataBiosphere / analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6 stars 3 forks source link

Association Aggregate #57

Closed aofarrel closed 2 years ago

aofarrel commented 3 years ago

This is not ready for a release as it lacks a checker workflow, but it's complicated enough it should be at least quickly reviewed in its current form.

aofarrel commented 2 years ago

The checker has revealed a potentially large issue. I misunderstood the implications of sbg_group_segments_1's output structures.

The CWL's return JSON for sbg_group_segments_1 looks like this.

Screen Shot 2021-10-25 at 2 27 22 PM

This means that the next task (assoc_combine_r) is scattering on the top level of grouped_assoc_files, ie, assoc_combine_r will be scattered into two tasks if you are running on two chromosomes. assoc_combine_r will return one combined file per chromosome.

Screen Shot 2021-10-25 at 2 37 49 PM

This is not how the WDL currently works. assoc_combine_r will instead scatter once per segment, resulting in assoc_combine_r returning one "combined" file per segment.

Screen Shot 2021-10-25 at 2 31 00 PM

There seems to be three possible ways to resolve this in the WDL:

  1. Just combine the output at the very end manually by having a task at the end of assoc_combine_r or the plotting task do the combining
  2. Attempt to make the WDL version of sbg_group_segments_1 work more like the CWL version in hopes of it giving the expected output, which may include also implementing the prior sbg_flatten_lists (which is supposedly just a relic of older CWL versions)
  3. Combine the output of sbg_group_segments_1 in a new task

Number 1 does not sit well with me as it means previous tasks have the wrong input, but the plotting step does actually seem consistent across Terra and SB as-is, so it could be valid... Number 2 would be ideal, although it didn't work last time... Number 3 would complicate the flow even further...