BjornFJohansson / pydna

Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.
Other
166 stars 45 forks source link

cSEGUID Collision #68

Closed ghost closed 3 years ago

ghost commented 4 years ago

The documentation says the following

The cSEGUID checksum uniquely identifies a circular sequence regardless of where the origin is set.

I'm not sure, does this mean there is no collision at all?

BjornFJohansson commented 4 years ago

Hi, The algorithm used has two parts.

First the lexicographically smallest string rotation is found for the DNA sequence. I think that there is mathematical proof that this rotation is unique for strings that are not concatenations of two substrings. If they are, the two rotations will be identical.

https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation

The second part is simply an url safe SHA-1 hash of the smallest string rotation. I think there has been no accidental SHA-1 collisions that I know of. Git is using SHA-1 although it seems to be transitioning to SHA-256.

For my purposes, this seems to be accurate enough. I have never experienced problems. If would be very easy to implement an upgraded version if needed.

Hope this helps, Björn .