CamelCaseCam commented 8 months ago

Describe the desired feature/enhancement

Some suggestions for how to improve the megamash algorithm

Describe the solution you'd like

Just throwing a thread here to brainstorm ideas and fixes - will keep this message short as an intro/title card

CamelCaseCam commented 8 months ago

My suggestions

Ditch base64

In lib/align/megamash/megamash.go, don't base64 encode the DNA strings. You already have them in byte form, so base64 encoding will make them longer and use more compute to encode. Go doesn't use null-terminated strings, so you should just be able to convert the bytes to a string directly (I think)

Stop redoing work

You're redoing a lot of work between making a compressed megamash map and using it. It looks to me like you're decompressing and then recompressing the DNA. Instead of decompressing and recompressing the sequences, why not pull the length out of the bytes and reuse the compressed sequence? I feel like this would speed things up a bunch

Use your own hashing algorithm

You're currently using the go map hashing algorithm. Instead, why not use your own? Using your own hash algorithm would let you control the algorithm to optimize for performance and there might be algorithms that would let you optimize megamash in other ways, ex. reusing part of the encoding for each k-mer if you were using xor

In fact, I'd really recommend just xor'ing the k-mers. Your compressed representation is dense, and I bet xor would work about as well as go's hash implementation, but go's implementation is way slower because it's general. The other benefit to xor is that it's commutative so if you have the hash for the sequence x..x+n, you can generate the hash for x+1..x+n+1 with only two operations, which would be a huge speedup for larger k-mers

It's late and I'm tired, so if I misunderstood the algorithm I'll take another look at it in the morning!

Koeng101 commented 8 months ago

You're redoing a lot of work between making a compressed megamash map and using it. It looks to me like you're decompressing and then recompressing the DNA. Instead of decompressing and recompressing the sequences, why not pull the length out of the bytes and reuse the compressed sequence? I feel like this would speed things up a bunch

You are completely right, I changed this in the latest update in the sequencing branch. Most of the allocations were in the compression bit, so I removed that and base64.

Here is how I personally do the benchmarking:

go test -bench=. -cpuprofile=cpu.out
go tool pprof cpu.out
(pprof) web

It seems like most the time is actually taken doing reverse complementation, then maps. If there is a better way to do reverse complementation, that would be wonderful, because we use it so much in the code base. I haven't touched it in a long while.

pprof001

I think another opportunity is making a map[string]bool with only true values rather than a []map[string]bool. Then, it's just going through the hash map.

In fact, I'd really recommend just xor'ing the k-mers

If it is a simple solution, I'm down! I just want the code base to be as readable as possible, so idiomatic until we need otherwise (which this might be one place if we get significant speed ups)

Koeng101 commented 8 months ago

You know, maybe just inserting both kmers is going to be better. Then you don't have to reverse complement at all during scoring. The vast majority of the compute will be on matching, not building.

CamelCaseCam commented 8 months ago

I'd insert both k-mers, but I've also thought of a way to optimize reverse complementation in general. I've got two optimizations that break from "nice" code to different degrees.

Use mutable strings

Go doesn't have mutable strings, which forces you to allocate a byte array and then fill that array with the reverse complement. You could make another function that takes in a mutable string (from a library) and modifies it in-place, avoiding the allocation. This would require you to switch to mutable strings everywhere, though, so it probably wouldn't be worth it

Better complementation algorithm

You're using the characters as indices into a table. This is great, but you can do better if you limit the function to only considering canonical bases. Consider the string in 4-base segments. First, make four tables that map the characters to two-bit IDs, shifted by their position in the segment (the shift is to avoid manual shifting). OR these IDs together to get an 8-bit index into a complement table for the 4-base segment, and then write to your output array. This could be further optimized by using unsafe code to take the array output as a 32-bit integer and write the integer to the bytes in a single instruction.

If you go with the first part of my suggestion, you're trading (on average, probably) a CMP and branch instruction at each base for 0.75 of an OR instruction, 0.25 array lookups, and 0.25 CMP and branch. This will probably come out better, and the compiler may even vectorize it or implement the 32-bit thing.

Writing it as a single integer is obviously better and totally something I would do in C, but I'd understand if you don't want to do that. Also, you'll have to do a little bit of extra work to handle base numbers that aren't multiples of 4, but I suspect you could get at least a 2x speedup on complementation specifically just with the first part.

Koeng101 commented 8 months ago

The algorithm is intended to only run with the 4 bases, so that is alright. The problem I see with the conversion into two bit IDs is that that's exactly what the compressed DNA function did, but it took a while to run.

It is of utmost importance to me that the code stays very readable in the long term, so I think I'm pretty fine with current performance with a few tweaks (like single map, kmers in both directions, etc).

Could also use a specialized map algorithm like swissmap, but I'm honestly not sure if I would want to integrate that due to it being a dependency. Quite happy with the very minimal dependencies we have right now.

CamelCaseCam commented 8 months ago

Ah - I just took another look at the compress function. First, the % operator is really slow, I'd replace it. Also, using a switch statement here isn't good because it leads to a branch if the compiler doesn't optimize it away and I can't tell if go uses a hash map or binary search to implement switch statements. Instead, make a map and use something like the following code:

var subidx = 0
for i, nucleotide := range dna {
        currentByte |= nucID[nucleotide] << (6 - (i%4)*2)
        subidx += 1
        if subidx == 4 || i == length-1 {
            dnaBytes = append(dnaBytes, currentByte)
            currentByte = 0
            subidx = 0
        }
    }

Also, and possibly more importantly, you're appending to dnaBytes but you haven't preallocated the amount of memory you're using, which leads to more allocations. Try using make and setting the capacity to the length of DNA / 4 (as an integer) plus 1

Koeng101 commented 8 months ago

Ah - I just took another look at the compress function. First, the % operator is really slow, I'd replace it. Also, using a switch statement here isn't good because it leads to a branch if the compiler doesn't optimize it away and I can't tell if go uses a hash map or binary search to implement switch statements. Instead, make a map and use something like the following code:

Ah this is all true! I was actually encountering other issues as well with compress, not necessarily related to speed, when working with blowq files (a new file format I had made to compress fastq files).

I remove compress dna from sequencing, so how about we continue this conversation at https://github.com/Koeng101/dnadesign/pull/51 ? This is probably where compress dna will live.

Koeng101 commented 5 months ago

I'm going to close this, as megamash recent versions is working quite well in production

Koeng101 / dnadesign

Better Megamash #52

Describe the desired feature/enhancement

Describe the solution you'd like

My suggestions

Ditch base64

Stop redoing work

Use your own hashing algorithm

Use mutable strings

Better complementation algorithm