andy-thomason / genetics

A boost library for genetic searches
Boost Software License 1.0
8 stars 1 forks source link

Suffix Arrays #3

Closed kloetzl closed 8 years ago

kloetzl commented 9 years ago

The section in the Readme on suffix arrays is flawed.

Most genetic searches revolve around the creation of Suffix Arrays which are a sort of all the substrings of a sequence to the end.

Yes, and no. Most use an enhanced suffix array.

"hello" consists of the substrings "hello", "ello", "llo", "lo", "po" and "" The suffix array is: 5 "" 1 "ello" 0 "hello" 3 "lo" 2 "llo" 4 "o"

"llo" < "lo"!

We can use the suffix array to look up the position of any sequence by, for example, doing a binary chop search on the suffixes. In practice this is not a sensible thing to do as each of the memory accesses to the suffixes may cost thousands of cycles.

Accessing main memory should be a few hundred (like 200) cycles; not thousands.

Calculating a full suffix array for the 3.2 billion letters of the human genome is an expensive process, but fortunately we don't need to compute a full suffix array as sequences of 16 or more letters are likely to be unique.

“Likely“ is the important word here. See this paper for the distribution of match lengths. Also, here is a recent paper, using your uniqueness idea to create a SA.

Far better is to calculate a hybrid indexed suffix array which gives you all the substrings starting with a fixed number of letters. For this we add an index of 2^(2N) entries for N letters the suffixes sorted by address in the array instead of value.

What is far better really depends on what one wants to do with the SA. Searching for substrings, MEMs, MUMs and computing matching statistics are completely different things! Some require the LCP array, or even RMQ (CLD).

Probably one of the best C++ libraries for SAs is SDSL-lite. Go, check it out.

Best, Fabian

ps. You should http://choosealicense.com/

andy-thomason commented 9 years ago

Hi Fabian,

Thanks for the feedback. Feel free to send me a pull request for the README.

I don't have the time to check to see if these ideas are unique and don't care much. More important is to deliver a reliable searching service that generates 100% of the available results. This code is far from optimal and it will be improved once the user base is established.

I am very keen to establish additional use cases, however and if you could help this would be good.

I have experimented with a large number of search techniques such as batch inner joins which are ideal for large numbers of searches but are complex to implement and difficult to understand. This implementation is a compromise and delivers an adequate performance of around 25k human genome searches per second per thread on an average machine for 100 character strings. We can probably push this to 50k if some work is done, but readability will suffer.

Main memory accesses will average around 200 cycles, especially if you pre-sort your search keys, but most CPUs will do a TLB refill for every new page accessed and this I have measured at around 3000 cycles. This is mitigated somewhat as the accesses are done in parallel using the 8 or so retirement slots available in a typical core's memory unit.

The main motive for doing this is the shocking state of codebases which would not pass any QA criteria in a commercial environment; they are obscure, poorly commented, use acronyms and abbreviations and generally are unmaintainable. I am used to writing compilers for Sony with 100k+ developers and 500 million end users. Any error is clearly an issue here :o)

Anyway, thanks for taking a look.

Andy.

andy-thomason commented 9 years ago

Thanks, Fabian.

A good set of pointers.

I am still reading,

kloetzl commented 9 years ago

Main memory accesses will average around 200 cycles, especially if you pre-sort your search keys, but most CPUs will do a TLB refill for every new page accessed and this I have measured at around 3000 cycles. This is mitigated somewhat as the accesses are done in parallel using the 8 or so retirement slots available in a typical core's memory unit.

I never used to care about the TLB. But 3000 cycles sounds scary. Do you have a readup for me?

I don't have the time to check to see if these ideas are unique and don't care much. More important is to deliver a reliable searching service that generates 100% of the available results. This code is far from optimal and it will be improved once the user base is established.

Basically, the Readme is missing a sentence on the problem you are trying to solve with the suffix array i.e. a lot of short lookups.

andy-thomason commented 9 years ago

Chip makers are rather loathe to reveal actual timings, so it is best to measure yourself with pairs of rdtscs. What we really need is an effective non-temporal integer load instruction to bypass the complex L2 cache machinery. Much will depend on the architecture, especially as benchmarks rarely focus on latency. On the PS4, we had 256 bit wide GDDR5 memory @ ~5G. I would murder for this on a PC!

I'll add some text to the readme to clarify. I hope I don't sound ungrateful, this is great feedback, thanks.

kloetzl commented 9 years ago

non-temporal integer load instruction

Now you have lost me. ☺

I hope I don't sound ungrateful, this is great feedback, thanks.

Dito, I am learning a lot here.

andy-thomason commented 9 years ago

Non temporal loads fetch the data without retaining it in cache. Conversely NT stores write through the cache directly to DRAM if possible so that the cache does not need to be flushed.

Random reads and writes with NT loads and stores can be a lot faster, but not always.

There are SSE intrinsics for both and ARM and Power PC architectures support them.

No, I'm learning a lot too!

kloetzl commented 9 years ago

Non temporal loads fetch the data without retaining it in cache. Conversely NT stores write through the cache directly to DRAM if possible so that the cache does not need to be flushed. Random reads and writes with NT loads and stores can be a lot faster, but not always.

This sound super-useful w.r.t suffix arrays as T[SA[i]+k] is basically a cache-killer for different i. I really should look into that.