chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

truncate consensus genome inputs #134

Closed morsecodist closed 3 years ago

morsecodist commented 3 years ago

This truncates consensus genome inputs following the same logic we use in short-read-mngs. It also throws an error if you are submitting from illumina and have reads of over 300 base pairs.

I held off on this because it is not as performant as it could be and there are more improvements we could make to validation but this meets the requirements and a bit beyond and is a good addition in my view. We can tweak it as needed. The biggest slowdown is actually gzipping and un gzipping inputs and outputs. I kept it as is for now but we may want to revisit this. I tried a rust version but it was not much faster because the gzipping is really the time consuming bit and both use performant gzip implementations (maybe even the same one for all I know).

Added tests too.