corneliusroemer / pango_aliasor

Utility to alias and dealias pango lineages
MIT License
21 stars 6 forks source link

Compressing and uncompressing name_split #6

Closed dtlyfoung closed 1 year ago

dtlyfoung commented 1 year ago

When using Aliasor() to uncompress/compress aliased/unaliased lineages, there are instances when an unexpected lineage is returned. When running the following:

from pango_aliasor.aliasor import Aliasor

aliasor = Aliasor()
print(aliasor.uncompress("CC"))
print(aliasor.compress("B.1.1.529.5.3.1.1.1.2"))

I expected to see:

B.1.1.529.5.3.1.1.1.2
CC

But the code above as it's written outputs the following:

CC
BE.1.1.2

Is this the expected behavior and I am misunderstanding the code?

I feel this is the result of the name_split in the compress and uncompress functions. When I added "." at the end of each of my arguments, the code outputted my expected behavior:

from pango_aliasor.aliasor import Aliasor

aliasor = Aliasor()
print(aliasor.uncompress("CC."))
print(aliasor.compress("B.1.1.529.5.3.1.1.1.2."))

Output:

B.1.1.529.5.3.1.1.1.2.
CC.

Note that these are the aliased/unaliased lineages with "." appended to the ends.

ciscorucinski commented 1 year ago
PANGO Partial PANGO Unaliased PANGO
BE.1.1.2 BA.5.3.1.1.1.2 B.1.1.529.5.3.1.1.1.2
CC.1 BA.5.3.1.1.1.2.1 B.1.1.529.5.3.1.1.1.2.1

The value "B.1.1.529.5.3.1.1.1.2" will always return BE.1.1.2..There aren't enough subdivisions to re-alias it to CC; there needs to be 4 dots to re-alias.

That is why your additional-dot workaround "works". The additional-dot workaround is split into 5 divisions BE, 1, 1, 2, <empty string>, and that qualifies it to re-alias to CC. (CC, <empty string> divisions).

Inverse Property

As long as you pass in a valid PANGO lineage, the Inverse Property of uncompress(...)/compress(...) will hold; the uncompressed string can be compressed back to the original valid PANGO lineage. Please note that this library does not sanitize inputs, and validity is up to the user of this library.

The only valid PANGO lineages that don't have dots are... 1) One of the two original haplotypes (A or B) 2) Any original recombinant haplotypes (XA, XBB, etc...)

So technically CC isn't a valid PANGO lineage, and as such you can't believe that an invalid PANGO lineage will follow an "Inverse Property". However, all valid PANGO lineages should follow the "Inverse Property"

Valid PANGO Lineage

CC.1 will uncompress to B.1.1.529.5.3.1.1.1.2.1. By the Inverse Property, B.1.1.529.5.3.1.1.1.2.1 will compress to CC.1.

CC.1 == CC.1 🟩 Inverse Property holds

Invalid PANGO Lineage

However, because CC is an invalid PANGO lineage, it will uncompress to B.1.1.529.5.3.1.1.1.2. When compressing B.1.1.529.5.3.1.1.1.2, you will get BE.1.1.2 (a valid PANGO lineage)

CC != BE.1.1.2 🟥 Inverse Property doesn't hold

dtlyfoung commented 1 year ago

Thank you @ciscorucinski - I think I understand this and your comment helps enough for me to close this out.