estarriolvetch / solidity-bits

MIT License
33 stars 5 forks source link

Yul optimisation #2

Open sillytuna opened 2 years ago

sillytuna commented 2 years ago

Solidity does a pretty good job generally but some things help, for example:

        //uint8(LOOKUP_TABLE_256[(isolateLS1B256(bb) * DEBRUIJN_256) >> 248]);
        bytes memory table = LOOKUP_TABLE_256;
        assembly {
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            result := mload(add(table, add(bb, 1)))
            result := and(0xff, result)
        }
    }

Whereas this function was the same either way:

    function set(BitMap storage bitmap, uint256 index) internal {
        uint256 bucket = index >> 8;
        uint256 mask = MASK_INDEX_ZERO >> (index & 0xff);
        bitmap._data[bucket] |= mask;
        //assembly {
        //    let bucket := shr(8, index)
        //    let mask := shr(and(index, 0xff), 0x8000000000000000000000000000000000000000000000000000000000000000)
        //    mstore(0, bucket)
        //    mstore(0x20, bitmap.slot)
        //    let slot := keccak256(0, 0x40)
        //    let current := sload(slot)
        //    sstore(slot, or(current, mask))
        //}
    }

However, I've done a test version with Chiru's 4.2 ERC721A (combined with a lengthless array optimisation around storage for this and another struct) and it was easiest to simply port across the two functions used and merge them in rather than me go through and yul/test the solidity-bits lib.

Happy to help if you want the lib improved though.

sillytuna commented 2 years ago

Another function I converted:

        bytes memory table = LOOKUP_TABLE_256;
        assembly {
             let bucketIndex
             let bb
            //bucket = index >> 8;
            let bucket := shr(8, index)
            // index within the bucket
            //uint256 bucketIndex = (index & 0xff);
            bucketIndex := and(index, 0xff)

            for {
                // load a bitboard from the bitmap.
                //uint256 bb = bitmap._data[bucket];
                mstore(0, bucket)
                mstore(0x20, bitmap.slot)
                bb := sload(keccak256(0, 0x40))
                // offset the bitboard to scan from `bucketIndex`.
                //bb = bb >> (0xff ^ bucketIndex); // bb >> (255 - bucketIndex)
                bb := shr(xor(0xff, bucketIndex), bb)
                if iszero(bb) {
                    bucketIndex := 255
                }
            } iszero(bb) {
                mstore(0, bucket)
                bb := sload(keccak256(0, 0x40))
            } {
                if iszero(bucket) {
                    revert(0, 0)
                }
                bucket := sub(bucket, 1)
            }

            //bb.bitScanForward256()
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            bb := mload(add(table, add(bb, 1)))
            bb := and(0xff, bb)
            //result = (bucket << 8) | (bucketIndex - bb.bitScanForward256());
            result := or(shl(8, bucket), sub(bucketIndex, bb))
        }
    }
sillytuna commented 2 years ago

The 721A version and notes can be found in my comment here:

https://github.com/chiru-labs/ERC721A/issues/391#issuecomment-1199728087

estarriolvetch commented 2 years ago

@sillytuna This is interesting. Do you have a gas usage comparison (with solidity optimizer enabled)?

sillytuna commented 2 years ago

I was testing in situ. Marginal gains from the second two funcs but worth it on something like this, esp if could be called a lot.

Should be quick for you to put in your code and run a quick gas comparison tho.

estarriolvetch commented 2 years ago

My concern is it may affect the code readability. I also saw you lengthless array PR for ERC721A.

Maybe shifting from mapping to lengthless array for the bitMap datastructure will be a bigger improvement.

sillytuna commented 2 years ago

My view on that is it depends on use. Code like this can be needed in gas performant functions and will be write once use many. I'd keep the original solidity in hand to be read for sure and yul would need well testing and gas checks.

Your code is already quite hard to understand just because of the algorithm so I'm not sure yul really changes that. Can be well commented and include solidity comments.

estarriolvetch commented 2 years ago
        //uint8(LOOKUP_TABLE_256[(isolateLS1B256(bb) * DEBRUIJN_256) >> 248]);
        bytes memory table = LOOKUP_TABLE_256;
        assembly {
            bb := and(bb, sub(0, bb))
            bb := mul(bb, DEBRUIJN_256)
            bb := shr(248, bb)
            result := mload(add(table, add(bb, 1)))
            result := and(0xff, result)
        }
    }

I took a look and it seems the gas saving comes from two factors:

Unless there is a good reason, I don't think require(bb>0) should be removed. As for the merging the function, solidity optimizer now have this optimization stage and should take care that automatically based on the optimization parameter you specify. It's a deployment gas vs run time gas problem.

sillytuna commented 2 years ago

Makes sense. I forgot that I'd removed that require because of the specific use case essentially having it covered.

It's certainly worth checking what's worth being yul / inline and what isn't.