Arcadia-Science / noveltree

NovelTree is a highly parallelized and computationally efficient phylogenomic workflow that infers gene families, gene family trees, species trees, and gene family evolutionary history.
GNU Affero General Public License v3.0
17 stars 3 forks source link

Ap/sppid protid delim #84

Open austinhpatton opened 1 year ago

austinhpatton commented 1 year ago

Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.

Old convention was: Genus_species:proteinID

The colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.

Now, the convention is: Genus-species_proteinID

The changes I implemented here basically just parameterize the delimiter, making _ the default, but splitting the two identifiers using the parameter value within the annotation module.

I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.

austinhpatton commented 1 year ago

Okay, so I've made a number of changes, and this now works as anticipated.

  1. I've made the naming of the sppid_protid_delim parameter consistent throughout
  2. The delimiter is provided as input to the cogeqc R script, which is then used to split the sequence headers - this works using either naming convention.

I've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow.