Open austinhpatton opened 1 year ago
Okay, so I've made a number of changes, and this now works as anticipated.
sppid_protid_delim
parameter consistent throughoutI've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow.
Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.
Old convention was:
Genus_species:proteinID
The colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.
Now, the convention is:
Genus-species_proteinID
The changes I implemented here basically just parameterize the delimiter, making
_
the default, but splitting the two identifiers using the parameter value within the annotation module.I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.