[BUG] - Assembler identification can sometimes fail for flye

jfnjdoh commented 5 months ago

When using the SCAFFOLDS entry point, Phoenix determines the assembler by looking at the name of the first of the entry in the assembly fasta, see here https://github.com/CDCgov/phoenix/blob/main/bin/rename_fasta_headers.py#L124. For flye, it makes the assumption that the name of the first contig is always contig_1. However, sometimes it is not, and flye's developer said this is normal behavior (https://github.com/fenderglass/Flye/issues/667). Hence, when this happens, the pipeline fails at line 161, being unable to determine the assembler.

For now I've just been manually changing the name to contig_1 but that's a bad solution. A better one might be to

Use a regex instead, perhaps contig_\d+, though complications might ensue if names are similar between assemblers, but it seems that you'd be ok in this case as long as the c is case sensitive
Let the user enter the name of the assembler as an argument --assembler and skip those checks if the --assembler parameter is specified

jvhagey commented 5 months ago

Hey @jfnjdoh thanks for the excellent documentation. We are just about to release the new v2.1.0 version of phoenix this week and I added handling for when it can't determine the assembler. Are you able to run the v2.1.0-dev branch -entry SCAFFOLDS and let me know if it get past that step now?

jfnjdoh commented 5 months ago

I took a sample that I knew already worked and had contig_1 as the first name, changed it to contig_224 and reran and it worked fine on the dev version. Thanks for the fix.

jvhagey commented 5 months ago

whoo hoo I love an easy fix. Keep an eye out for the new release at the end of the week. If you want to be included in release emails then email HAISeq@cdc.gov and you request to be added to the list serve. Happy sequencing!

CDCgov / phoenix

[BUG] - Assembler identification can sometimes fail for flye #131