chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

replace python script with seqkit (Go) #172

Closed rzlim08 closed 2 years ago

rzlim08 commented 2 years ago

The performance here should be greatly improved. On a test dataset (1.4GB), the step took 1m26s compared to 43m27s.

Still need to do some testing on seqkit's error conditions. @katrinakalantar do you think we may have any issues with using seqkit? I'm mainly concerned that it may be more restrictive than previous validation steps that we have had.

rzlim08 commented 2 years ago

I added some more robust error handling, including a FASTA check. There are slight differences in the gzip files because seqkit uses a parallel zipper. The unzipped files should be identical to the previous code.