ethereum / eth-hash

The Ethereum hashing function, keccak256, sometimes (erroneously) called sha256 or sha3
MIT License
104 stars 64 forks source link

Consider dropping type checks in Keccak256 methods #16

Open gsalgado opened 6 years ago

gsalgado commented 6 years ago

The __call__() and new() methods there will check the type of their arguments using isinstance, and although that should be a cheap check, these are code paths where performance is paramount given the number of times they are called when we perform trie operations on a fully synced DB. For example, when doing a full sync around 10% of the time is being spent into isinstance calls. Even though they're probably not all originating from these checks, I think we should consider removing them as these two methods have type annotations that should make it clear enough what is wrong in case they crash when passed an argument of the wrong type

gsalgado commented 6 years ago

So, it turns out that during a full sync we're spending so much time on isinstance() calls because we're making an insane amount of them: 21926745 calls in a 3 minute run, but only 421334 (2%) of those are coming from these codepaths, so we might not gain anything by dropping these checks after all

pipermerriam commented 6 years ago

@gsalgado I'm betting most of those isinstance checks are happening in the validation code within the EVM thats checking things like this.:

gsalgado commented 6 years ago

The biggest offenders are actually rlp and trie: rlp/codec.py:72(encode_raw), trie/validation.py:11(validate_is_bytes), rlp/sedes/serializable.py:300(make_immutable). Just those 3 seem to account to almost half of all isinstance() calls

pipermerriam commented 6 years ago

Hrm, I could get onboard with dropping the check from trie. It's not clear to me how to drop the check from rlp but we do call make_immutable on every argument during object instantiation. It'd be nice to drop this but I like that when you pass a list into an rlp object it gets cast to a tuple.

davesque commented 6 years ago

I haven't looked at the specific code, but I wonder if the behavior of some of those checks could be duplicated by appropriately coded try/except blocks. According to the docs, try/except is very efficient when an exception is not raised: https://docs.python.org/3.6/faq/design.html#how-fast-are-exceptions.

pipermerriam commented 6 years ago

not the right solution but I did have a look at functools.singledispatch as a potential remedy for this but it's about 4x slower than a manual isinstance check (but way more elegant)

carver commented 6 years ago

rlp/codec.py:72(encode_raw) trie/validation.py:11(validate_is_bytes) rlp/sedes/serializable.py:300(make_immutable)

I think we can improve performance on all of these with assert statements. That way, default usages (and our tests) will work slower & safer, but we can disable asserts when running trinity client.

trie/validation.py:11(validate_is_bytes)

Trivial to move to asserts

rlp/sedes/serializable.py:300(make_immutable)

Not sure about this one. Can we simply disallow list as a valid input type, and use assert statements to reject lists?

rlp/codec.py:72(encode_raw)

This one is tougher, but here is one way to reduce the calls of isinstance by 2/3rds:

old code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    elif not isinstance(item, str) and isinstance(item, collections.Sequence):
        payload = b''.join(encode_raw(x) for x in item)
        prefix_offset = 192  # list
    else:
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        raise EncodingError(msg, item)

new code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    else:
        try:
            assert not isinstance(item, str)
        except AssertionError as exc:
            raise encoding_error(item) from exc

        try:
            payload = b''.join(encode_raw(x) for x in item)
        except TypeError as exc:
            raise encoding_error(item) from exc

        prefix_offset = 192  # list

def encoding_error(item)
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        return EncodingError(msg, item)
carver commented 6 years ago

I guess this should really be moved to py-evm/pyrlp/py-trie -- seems like we shouldn't remove the type checks in eth-hash until they are clearly a bottleneck.

pipermerriam commented 6 years ago

We can vendor the encode_raw implementation into py-trie. That's an easy optimization, allowing us to drop checks without any compromise to the actual rlp library. Bonus is that if we go ahead and vendor the decode as well we can drop the rlp dependency.

davesque commented 6 years ago

I've always felt weird about disabling asserts since some libraries might incorrectly use them as a general exception throwing mechanism. Thoughts on that?

carver commented 6 years ago

I hope that having libraries break when you turn off assertions would be very rare. I would certainly consider that broken. Is that something you've run into before? @davesque

davesque commented 6 years ago

@carver Actually, no thankfully. And I guess it's probably not a very big risk and we'd notice it pretty quickly if a library broke in optimized mode. But it does seems a bit undesirable that one should have to use -O to get decent performance. I've never run across a library or tool where that was the case. Maybe I've just never had occasion to use one that worked that way. Or is there a way to disable asserts on a per module basis? If so, that would probably be a decent way to go.

But maybe it's possible that some of that manual type checking could just be removed altogether since a wrong value type would lead pretty quickly to a fairly obvious error (that's why I suggested the try/except approach)? It just feels a bit like we might be trying to shoehorn the wrong concerns into a dynamically typed language.

pipermerriam commented 6 years ago

eurika! (maybe)

We should be able to get roughly the same level of confidence with type checking via mypy.

davesque commented 6 years ago

@pipermerriam Yes, I agree with that. Even if third party users of our libraries didn't use mypy, it would still give some guarantees about their internal correctness and the interactions between them.

pipermerriam commented 6 years ago

Well, currently we have type definitions for eth-keys defined like this: https://github.com/ethereum/py-evm/tree/master/stubs

We can move this stuff to a submodule and add type definitions for anything else which is stable, share the submodule across repositories and get some stronger type safety checks across library boundaries, so even consumers of py-trie/eth-hash/whatever should be able to still get solid type safety checks.

carver commented 6 years ago

We should be able to get roughly the same level of confidence with type checking via mypy.

Which comes at the cost of requiring that any module that uses the method to use mypy to get a reasonable developer experience. For example, if a module uses encode_raw() and sends a str in as the item (without running mypy checks), we end up in an infinite recursion.

Infinite recursion also happens if running with python -O in the above example, but at least that was "opt-out" of nicer error messages, rather than mypy, which requires "opt-in" to get nicer error messages.

pipermerriam commented 6 years ago

It still doesn't sit well with me to disable assertions for performance gains.