corneliusroemer / pango_aliasor

Utility to alias and dealias pango lineages
MIT License
21 stars 6 forks source link

Functionality to Identify and Assign New Aliases #11

Open jmcbroome opened 1 year ago

jmcbroome commented 1 year ago

Proposed Changes

This pull request includes some additional functionality I wrote for identifying and assigning the next available alias code to arbitrary lineages in the course of working on my automated lineage designation pipeline. You might appreciate its addition to your package to assist in your own designation workflows, as well as for other users, though I would understand if you feel this is outside the scope of this particular tool or have concerns about these methods causing confusion for users who are not interested in designating lineages.

In terms of implementation, it works by converting the Pango aliases into base26 numbers, finding the maximum, and incrementing it by 1 to find the next available alias. It handles banned values (I, O, and X) by incrementing the characters past these when returning alias strings. Recombinant lineages (prefixed with X) are tracked as a separate group, but the same functions are available when the appropriate parameter is set.

It includes two new methods and a small number of hidden helper functions:

  1. Aliasor().next_available_alias(recombinant): returns the next available alias string. Set the recombinant parameter to True to get the next available recombinant alias (prefixed with X).
  2. Aliasor().assign_alias(name,recombinant): assigns the input name to the next available alias string. Assigns it to the next available recombinant alias if the recombinant parameter is True.

Additionally, it adds a new parameter to compress(), which when True automatically assigns a new alias string in the case of a fourth suffix level with no accepted alias. The default behavior matches the current behavior (raises an error for unhandled fourth suffix levels).

It's worth noting that I did not write code to automatically export an updated alias_key.json, mostly because information about the alias_key.json is lost on loading as you do not store the multiple recombinant parent lineages, and therefore a JSON rebuilt from the attributes of the Aliasor() object would be incomplete. This could be the subject of a future update.

I have followed the guidelines posted here and here in developing and testing this code. Please let me know if I missed any additional rules I missed, if there are unhandled cases I am not covering, or if you notice any other problems with these changes.

Testing

I've updated the tests with

  1. a simple check that fetches the next alias and asserts that setting recombinant=True yields an alias with the 'X' prefix while recombinant=False does not.
  2. a check that assigns new aliases for both standard and recombinant lineages and asserts that they were stored correctly and assigned to the correct value.