Closed kloetzl closed 8 years ago
Hi Fabian,
Thanks for the feedback. Feel free to send me a pull request for the README.
I don't have the time to check to see if these ideas are unique and don't care much. More important is to deliver a reliable searching service that generates 100% of the available results. This code is far from optimal and it will be improved once the user base is established.
I am very keen to establish additional use cases, however and if you could help this would be good.
I have experimented with a large number of search techniques such as batch inner joins which are ideal for large numbers of searches but are complex to implement and difficult to understand. This implementation is a compromise and delivers an adequate performance of around 25k human genome searches per second per thread on an average machine for 100 character strings. We can probably push this to 50k if some work is done, but readability will suffer.
Main memory accesses will average around 200 cycles, especially if you pre-sort your search keys, but most CPUs will do a TLB refill for every new page accessed and this I have measured at around 3000 cycles. This is mitigated somewhat as the accesses are done in parallel using the 8 or so retirement slots available in a typical core's memory unit.
The main motive for doing this is the shocking state of codebases which would not pass any QA criteria in a commercial environment; they are obscure, poorly commented, use acronyms and abbreviations and generally are unmaintainable. I am used to writing compilers for Sony with 100k+ developers and 500 million end users. Any error is clearly an issue here :o)
Anyway, thanks for taking a look.
Andy.
Thanks, Fabian.
A good set of pointers.
I am still reading,
Main memory accesses will average around 200 cycles, especially if you pre-sort your search keys, but most CPUs will do a TLB refill for every new page accessed and this I have measured at around 3000 cycles. This is mitigated somewhat as the accesses are done in parallel using the 8 or so retirement slots available in a typical core's memory unit.
I never used to care about the TLB. But 3000 cycles sounds scary. Do you have a readup for me?
I don't have the time to check to see if these ideas are unique and don't care much. More important is to deliver a reliable searching service that generates 100% of the available results. This code is far from optimal and it will be improved once the user base is established.
Basically, the Readme is missing a sentence on the problem you are trying to solve with the suffix array i.e. a lot of short lookups.
Chip makers are rather loathe to reveal actual timings, so it is best to measure yourself with pairs of rdtscs. What we really need is an effective non-temporal integer load instruction to bypass the complex L2 cache machinery. Much will depend on the architecture, especially as benchmarks rarely focus on latency. On the PS4, we had 256 bit wide GDDR5 memory @ ~5G. I would murder for this on a PC!
I'll add some text to the readme to clarify. I hope I don't sound ungrateful, this is great feedback, thanks.
non-temporal integer load instruction
Now you have lost me. ☺
I hope I don't sound ungrateful, this is great feedback, thanks.
Dito, I am learning a lot here.
Non temporal loads fetch the data without retaining it in cache. Conversely NT stores write through the cache directly to DRAM if possible so that the cache does not need to be flushed.
Random reads and writes with NT loads and stores can be a lot faster, but not always.
There are SSE intrinsics for both and ARM and Power PC architectures support them.
No, I'm learning a lot too!
Non temporal loads fetch the data without retaining it in cache. Conversely NT stores write through the cache directly to DRAM if possible so that the cache does not need to be flushed. Random reads and writes with NT loads and stores can be a lot faster, but not always.
This sound super-useful w.r.t suffix arrays as T[SA[i]+k]
is basically a cache-killer for different i
. I really should look into that.
The section in the Readme on suffix arrays is flawed.
Yes, and no. Most use an enhanced suffix array.
"llo" < "lo"!
Accessing main memory should be a few hundred (like 200) cycles; not thousands.
“Likely“ is the important word here. See this paper for the distribution of match lengths. Also, here is a recent paper, using your uniqueness idea to create a SA.
What is far better really depends on what one wants to do with the SA. Searching for substrings, MEMs, MUMs and computing matching statistics are completely different things! Some require the LCP array, or even RMQ (CLD).
Probably one of the best C++ libraries for SAs is SDSL-lite. Go, check it out.
Best, Fabian
ps. You should http://choosealicense.com/