bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
659 stars 68 forks source link

Add seqhash v2 #398

Closed Koeng101 closed 7 months ago

Koeng101 commented 8 months ago

Changes in this PR

Fixes #397 . This PR makes seqhash v2, which is both compressed (~3x smaller than seqhash v1) and has compatibility for encoding fragments as seqhashes.

This PR has 100% test coverage.

Why are you making these changes?

Seqhash v1 produces hashes that are too large, and cannot be used to encode genetic parts / fragments.

Are any changes breaking? (IMPORTANT)

No

Pre-merge checklist

All of these must be satisfied before this PR is considered ready for merging. Mergeable PRs will be prioritized for review.

Koeng101 commented 8 months ago

Don't merge this yet: I am unsatisfied with the fragment algorithm, and want to fix it up.

Koeng101 commented 8 months ago

Ready for check @TimothyStiles

Koeng101 commented 7 months ago

@TimothyStiles Feel good about it. Changed.

Koeng101 commented 7 months ago

Least Rotation is super applicable and allows people to hash plasmids reliably.

It doesn't though - you still have to do least comparison in order to get reverse complements.

Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison?

I don't think they should be used for comparison at least, not many opinions on cryptography but it's hard to fix non cyptographic later if someone maliciously breaks things. Seqhash is used as an identifier - the kmer table or mash or blast index or minimap index or whatever can be on the other side of the index. Let's you change what your comparison algorithm is, which is much more important that your hashing algorithm.

Should V2 just be the metadata and we let users choose their own hashing method?

No, I don't think this is a good idea when trying to make relatively stable identifiers.

How many different hash algorithms should we intend to support and how should they be named?

However many are necessary! And I'm not really sure about naming - but might be good to think about.

Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?

I think I explain the use case for seqhashV2 pretty clearly in the package documentation. It shouldn't be called FragHash because it isn't just for fragments, it's for DNA/RNA/Proteins/Fragments. Fragment hashing is just a feature of v2 that v1 did not have.

"When should I use V1 rather than V2?"

This is a good question that I should add to the top of the package docs, but I do think it is answered if you read the v2 documentation. ie, V2 is much smaller but sacrifices length from 256 to 120 bits, increasing collision chance. But more people will probably have that question and not bother to read the full docs.

"how many versions will there be

"How many versions will there be" I think is literally unanswerable because you can't fully predict the future, and if it was predictable, that would be bad, because then it would be inflexible to the changing needs of users. The point of having versions is so that you can add more later, otherwise just do away with versioning at all.

how will I know which one I should be using if there's, V3, V4, etc?".

Well, presumably we can explain that to the user.

V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?

Rather than use 256 bits for encoding
the hash, we use 120 bits. Since seqhashes are not meant for security, this
is good enough (50% collision with 1.3x10^18 hashes), while making them
conveniently only 16 btyes long

Less collision chance. But generally speaking, I think you should probably always use V2 over V1.

TODO for me: