Cyan4973 / mmc

Automatically exported from code.google.com/p/mmc
GNU Lesser General Public License v3.0
26 stars 12 forks source link

Simple optimization? #4

Open artelk opened 1 year ago

artelk commented 1 year ago
typedef struct {
    const BYTE* levelUp;
    const BYTE* nextTry;
} selectNextHop_t;

(apart from obvious using offets instead of pointers) you could cache the next (MINMATCH+1) byte of the sequence

typedef struct {
    const int32 levelUp: 28;
    const int32 nextTry: 28;
    int8 nextByte: 8;
} selectNextHop_t;

while promoting to the next level the nextByte should be updated to keep the next-next symbol (symbol at MINMATCH+LEVEL offset). While searching you will start from comparing with the nextByte from the structure (and that value will be different in most cases). And I believe in theory that can signifinactly reduce the amount of memory jumps into the buffer and improve performance 😃

Cyan4973 commented 1 year ago

This is a sound idea. It seems this would reduce the amount of memory accesses, so it could improve speed.

The bit twiddling hacks to store the necessary information into a packed structure tends to be a bit concerning, as there are ways this could end badly, both for portability and for performance. But I guess this is secondary topic that could be addressed in a second step.

artelk commented 1 year ago

Dear Yann, Could you please explain one thing. Capture How the link between the node A and the node C is implemented? Reading the code didn't help me to find the answer.. 😃 From what I understand the nextTry & MASK is the index of the next (previous in time) node at the lowest level (it would give you the link from A to B, not to C). How you were able to avoid adding a distinct explicit field like indexOfTheNextNodeOnTheSameLevel which value can be different from nextTry & MASK for the nodes on the levels above the lowest one? (upd: on the lowest level it can also be different to skip the nodes which were pushed to the upper levels)

Cyan4973 commented 1 year ago

It's been a while since I looked at this code.

If I understand correctly, this picture comes from the MMC homepage, and explains how a new chain, of guaranteed minimum match length N+1, is created.

Considering the previous steps, I don't think that the link from A to C existed beforehand. Instead, it is discovered, by scrubbing the previous chain (mml=n) Once a position of length N+1 is discovered, it's extracted from the mml=n chain, and becomes a new chained member to the chain mml=n+1. This is the operation that the red arrows symbolizes.

artelk commented 1 year ago

Yes, the picture is from that page 😄 I mean I thought the discovered link A->C is probably somehow stored for future use and so next time you will search for the same N+1 string you will be able to jump from A to C not visiting the B and other intermediate nodes on the lower level.

Cyan4973 commented 1 year ago

Once it's discovered, it's indeed saved, as a "good" link. But it first needs to be discovered. This is done by scrubbing the list of "unclassified" candidates.

artelk commented 1 year ago

My question is how this saving is done. I believe some field of the node A should keep that good link. But the nodes only have const BYTE* levelUp and const BYTE* nextTry.

The value A.nextTry % ChainTableSize is an index of B, not C. Maybe the nodes themselves are somehow moved inside the chain table? I mean the node C is copied to the position of the node B in the table and before that the node B is moved to some other place and so on.

Cyan4973 commented 1 year ago

The new chain starts with levelUp, which essentially states that, from now on, all candidates in the chain are guaranteed to be matches of at least N+1 bytes. Then, all remaining positions within the underlying N chain are scrubbed, and any match of length N+1 is brought into the N+1 chain, using the nextTry link. Doing so, such position leaves its previous chain, which was only guaranteeing N bytes (making it shorter). But note that, on reaching the N+1 chain, they are now considered "unsorted". Meaning the next time we reach this part of the chain, they will be tested for N+2 promotion. The promotion process ends on reaching the previous Gateway to N+1 (if it existed) or on reaching the end of the N chain.