JohnLonginotto / ACGTrie

A custom Numpy Trie for storing counts of sequences
8 stars 0 forks source link

Order of characters in Up2bit #2

Open kloetzl opened 8 years ago

kloetzl commented 8 years ago

"Why use A-00, C-01, T-10, G-11? It's not in alphabetical order […] !!?‽"

Plus, it is confusing for humans.

Secondly, it's because there's a neat trick you can do to convert those four letters of ASCII directly into up2bit values. […] In some programming languages like C, we can cut out just those numbers very quickly.

I very much like your trick. :+1: To fix the order, another hack can be used.

    char n; // the nucleotide
    n &= 6; // mask
    n >>= 1; // shift
    n ^= (n >> 1); // fix lower bit

The lower bit is flipped only when the upper bit is 1 (i.e. G and T). In C this compiles down to five instructions, all bit-twiddling, making this method the fastest order-preserving "hash". (See also http://blog.kloetzl.info/hashing-nucleotides-to-two-bits/)

JohnLonginotto commented 8 years ago

I like it - its smart :) However, we should be careful to remember that there are two levels of knowledge anyone should need - the level of "absolutely nothing", ie, "i dont care how its encoded, just store my DNA", and the level of "I know enough about how this works to implement it myself.", which is obviously what you and I have. However, an intermediate-level of knowledge is dangerous, since you know just enough information to confuse yourself. The problem your code solves is for those intermediate people - who know that DNA can be converted to binary, and expect the binary order to match the ASCII order, but dont know enough about the implementation to know that not maintaining the order makes faster conversions.

So, I think it's great that you are thinking about your users -- this is not done enough in Bioinformatics -- however, the real users are the Biologists who dont care about the specifics of the implementation, or that their binary on disk is in a pleasing order. We should write software for biologists and engineer ourselves out of the problem entirely. This is also something that doesn't happen enough in Bioinformatics (often people write very confusing software to make sure they stay needed). So if it doesn't matter what the order is of the final encoded data is -- and all that matters is that encoding/decoding is fast and easy -- I think its best to stick to the method thats fastest. But please dont take that as a criticism - I still think you did something awesome to get that to work in the first place :)

kloetzl commented 8 years ago

I like it - its smart :)

Your method was much smarter than anything I came up with so far. When I realized it could be improved I just thought I'd “ping back”.

The problem your code solves is for those intermediate people […] dont know enough about the implementation to know that not maintaining the order makes faster conversions.

Well, it is faster by a move, an xor and a shift, which, given any GHz processor, should be really hard to measure.

I did not honestly expect you to change your up2bit spec, because that would be an incompatible change. Which would upset users more than the order of some bits. :smile:

(I suggest closing as won't fix)

JohnLonginotto commented 8 years ago

Oh, well, no one uses ACGTrie yet -- the project had to be put on hold until after my PhD, which ends in the next 6 months. After that I will continue developing this, and the format specifications will be finalized. The fastest version of this code, by Jim Bailey in the dev branch, makes some significant alterations to the data format so before the project proceeds, these sorts of things need to be figured out. Ironically there, he has thought of a much faster, albeit less simple, file format - and i'm the one thinking we should stick to a simpler format that others can understand - so its kind of like to opposite of whats going on here :P So anyway, once I can commit a month to this project to get it finished, it's perfectly plausible that I could use your code for the DNA encoding other than the current method :)