DataBiosphere / analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6 stars 3 forks source link

[king] md5 mismatch for 2/3 outputs #51

Open avani-k opened 3 years ago

avani-k commented 3 years ago

king.wdl has 3 output files:

  1. Output of Task3: A .seg which contains kinship estimates. This file passes the md5 check.
  2. Output of Task 4: A .RData file. This file contains a matrix of kinship estimates. Task 4 takes the .seg file as input and transforms the data into a symmetrical matrix. The ordering of row and column names for these matrices is different in the outputs generated from the wdl and cwl. Hence, it fails the md5 check. Upon sorting the row and column names alphabetically, they were found to be identical. Previous issues have been found wherein the configuration of the run can affect the ordering of samples within files.
  3. Output of Task5: A .pdf file. This file contains a graph showing the kinship estimates. A visual inspection shows these files to be identical but due to the file format, we always expect the md5 check to fail.

king-checker.wdl will md5 check ONLY the .seg file (first output).

aofarrel commented 2 years ago

Not urgent, but it would be worth investigating if task 4's outputs pass all.equal() at default tolerance.