genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

handle unparseable GT entries by skipping those lines #1060

Closed johnmaruska closed 2 years ago

johnmaruska commented 2 years ago

Sometimes we'll encounter entries which have a $gt_str which has characters not contained in the $ids map. This results in malformed gt entries, like /0 or 1/ or /. When these entries occur, downstream calls like merge_vcf (in the case of immuno workflow) will print an error and silently continue on their merry way. I have no idea why they do this, but it's something we noticed when testing conversion over to WDL with HCC1395 sample.

We address this issue by checking the results of @gt_ids, and if any are undefined we eject early, returning undefined. At the call site we handle this case separate by printing to STDERR, dropping that line, and moving on to the remainder.

An alternative approach of a more strict failure is an option, but discussion with Chris Miller and Malachi Griffith landed on decision that we just drop and move on.