The performance here should be greatly improved. On a test dataset (1.4GB), the step took 1m26s compared to 43m27s.
Still need to do some testing on seqkit's error conditions. @katrinakalantar do you think we may have any issues with using seqkit? I'm mainly concerned that it may be more restrictive than previous validation steps that we have had.
I added some more robust error handling, including a FASTA check. There are slight differences in the gzip files because seqkit uses a parallel zipper. The unzipped files should be identical to the previous code.
The performance here should be greatly improved. On a test dataset (1.4GB), the step took 1m26s compared to 43m27s.
Still need to do some testing on seqkit's error conditions. @katrinakalantar do you think we may have any issues with using seqkit? I'm mainly concerned that it may be more restrictive than previous validation steps that we have had.