CDCgov / phoenix

🔥🐦🔥PHoeNIx: A short-read pipeline for healthcare-associated and antimicrobial resistant pathogens
Apache License 2.0
50 stars 17 forks source link

[BUG] - Assembler identification can sometimes fail for flye #131

Closed jfnjdoh closed 5 months ago

jfnjdoh commented 5 months ago

When using the SCAFFOLDS entry point, Phoenix determines the assembler by looking at the name of the first of the entry in the assembly fasta, see here For flye, it makes the assumption that the name of the first contig is always contig_1. However, sometimes it is not, and flye's developer said this is normal behavior ( Hence, when this happens, the pipeline fails at line 161, being unable to determine the assembler.

For now I've just been manually changing the name to contig_1 but that's a bad solution. A better one might be to

  1. Use a regex instead, perhaps contig_\d+, though complications might ensue if names are similar between assemblers, but it seems that you'd be ok in this case as long as the c is case sensitive
  2. Let the user enter the name of the assembler as an argument --assembler and skip those checks if the --assembler parameter is specified
jvhagey commented 5 months ago

Hey @jfnjdoh thanks for the excellent documentation. We are just about to release the new v2.1.0 version of phoenix this week and I added handling for when it can't determine the assembler. Are you able to run the v2.1.0-dev branch -entry SCAFFOLDS and let me know if it get past that step now?

jfnjdoh commented 5 months ago

I took a sample that I knew already worked and had contig_1 as the first name, changed it to contig_224 and reran and it worked fine on the dev version. Thanks for the fix.

jvhagey commented 5 months ago

whoo hoo I love an easy fix. Keep an eye out for the new release at the end of the week. If you want to be included in release emails then email and you request to be added to the list serve. Happy sequencing!