chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

fix issue with clusters in phylotree #156

Closed morsecodist closed 2 years ago

morsecodist commented 2 years ago

😲 not sure how I missed this...

What happened?

This is kind of a confluence of errors that combined to make them hard to notice.

First, I want to get the first cluster file which contains a file name per line. For each of those file names I want to move them into a new directory that will only hold the files that are in the cluster. Here is where I made my first mistake. I listed cluster_files to get the first file but I cated the filename without the directory name:

CLUSTER_FILE=$(ls cluster_files | head -n 1)
mkdir cluster
for hash in `cat $CLUSTER_FILE`
do

OK so this cat failed every time. Why didn't we notice? Well I forgot to add set -euxo pipefail so the script just continues on.

But the cluster directory is empty, how can we use it? It turns out we don't... I was just using the entire ska_hashes directory:

ska distance -o ska ska_hashes/*.skf
ska merge -o ska.merged ska_hashes/*.skf

This uses the hashes directory that contains all of the hashes from all of the clusters. Since we are not producing a tree if there is more than one cluster this is actually fine and the for loop is a carryover from when we were producing trees for each cluster. I can remove it completely and just use ska_hashes directly.

Also, we made too many clusters not an error. I removed the error version in the previous distances but I need to add the cluster check in this step. The check must occur in the same step now because the previous step will succeed and produce output even if there is more than one cluster.

So I did all this and it failed. It turns out iqtree is exiting with a non-zero exit code even though the test is asserting it produced a valid tree. So I had to remove -e from the settings.