bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
659 stars 68 forks source link

Compressed Seqhash #397

Closed Koeng101 closed 7 months ago

Koeng101 commented 8 months ago

What I want

I would like a more compressed Seqhash. Here is a current seqhash: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350

Here is the latter portion encoded in base58: HRWk6jLXJ3uvuKBnjyAhinEUsuzKbgpphDkrEcStX4AT - much shorter, 44 letters instead of 64. If we truncate to 16 bytes instead of 32 bytes, we get X8d1qRxANHFkdQM4kqKYWb. Much shorter! (base58 is nicer for encoding into various applications since it doesn't have any special characters)

If there are 8 bits in a byte, we can have a flag take up less space:

3 bit: seqhash version
1 bit: 15 byte version (vs 32 byte, which is default). 15 byte should be good enough for the majority of purposes, while being half the size, and the full seqhash nicely fits in 16 bytes.
1 bit: circularity
1 bit: double-strandedness
2 bit: DNA/RNA/PROTEIN (other left unspecified)

This would result in seqhashes that are 16 bytes, and would take up 22 characters of text rather than the current 71. This is far better than, say, 1000char when it comes to encoding a full protein.

I ask for 15 byte truncation rather than 16 byte truncation (120 bit vs 128 bit), so that the flags can fit, while still being a multiple of 2. (and to have a 50% chance of a collision in a 120-bit hash, you would need to generate approximately 1.357×10^18 hashes)

Why this is important

I am looking at building seqhashes into applications that interact with LLMs. LLMs are very bad at handling sequences, since they usually want to introspect inputs and outputs from APIs / interactive code. This introspection is also very useful for most users, since the AI can automatically improve if it can look at the full input / output of its code. However, the LLM then tries to look at sequences themselves, and then it gets kinda confused. This feature would greatly compress the amount of data needed to refer to genetic sequences.