Open gsalgado opened 6 years ago
So, it turns out that during a full sync we're spending so much time on isinstance()
calls because we're making an insane amount of them: 21926745 calls in a 3 minute run, but only 421334 (2%) of those are coming from these codepaths, so we might not gain anything by dropping these checks after all
@gsalgado I'm betting most of those isinstance
checks are happening in the validation code within the EVM thats checking things like this.:
The biggest offenders are actually rlp and trie: rlp/codec.py:72(encode_raw), trie/validation.py:11(validate_is_bytes), rlp/sedes/serializable.py:300(make_immutable). Just those 3 seem to account to almost half of all isinstance()
calls
Hrm, I could get onboard with dropping the check from trie
. It's not clear to me how to drop the check from rlp
but we do call make_immutable
on every argument during object instantiation. It'd be nice to drop this but I like that when you pass a list into an rlp object it gets cast to a tuple.
I haven't looked at the specific code, but I wonder if the behavior of some of those checks could be duplicated by appropriately coded try/except blocks. According to the docs, try/except is very efficient when an exception is not raised: https://docs.python.org/3.6/faq/design.html#how-fast-are-exceptions.
not the right solution but I did have a look at functools.singledispatch
as a potential remedy for this but it's about 4x slower than a manual isinstance
check (but way more elegant)
rlp/codec.py:72(encode_raw) trie/validation.py:11(validate_is_bytes) rlp/sedes/serializable.py:300(make_immutable)
I think we can improve performance on all of these with assert
statements. That way, default usages (and our tests) will work slower & safer, but we can disable asserts when running trinity client.
trie/validation.py:11(validate_is_bytes)
Trivial to move to asserts
rlp/sedes/serializable.py:300(make_immutable)
Not sure about this one. Can we simply disallow list as a valid input type, and use assert
statements to reject lists?
rlp/codec.py:72(encode_raw)
This one is tougher, but here is one way to reduce the calls of isinstance
by 2/3rds:
old code
if isinstance(item, Atomic):
if len(item) == 1 and item[0] < 128:
return item
payload = item
prefix_offset = 128 # string
elif not isinstance(item, str) and isinstance(item, collections.Sequence):
payload = b''.join(encode_raw(x) for x in item)
prefix_offset = 192 # list
else:
msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
raise EncodingError(msg, item)
new code
if isinstance(item, Atomic):
if len(item) == 1 and item[0] < 128:
return item
payload = item
prefix_offset = 128 # string
else:
try:
assert not isinstance(item, str)
except AssertionError as exc:
raise encoding_error(item) from exc
try:
payload = b''.join(encode_raw(x) for x in item)
except TypeError as exc:
raise encoding_error(item) from exc
prefix_offset = 192 # list
def encoding_error(item)
msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
return EncodingError(msg, item)
I guess this should really be moved to py-evm/pyrlp/py-trie -- seems like we shouldn't remove the type checks in eth-hash until they are clearly a bottleneck.
We can vendor the encode_raw
implementation into py-trie. That's an easy optimization, allowing us to drop checks without any compromise to the actual rlp
library. Bonus is that if we go ahead and vendor the decode
as well we can drop the rlp
dependency.
I've always felt weird about disabling asserts since some libraries might incorrectly use them as a general exception throwing mechanism. Thoughts on that?
I hope that having libraries break when you turn off assertions would be very rare. I would certainly consider that broken. Is that something you've run into before? @davesque
@carver Actually, no thankfully. And I guess it's probably not a very big risk and we'd notice it pretty quickly if a library broke in optimized mode. But it does seems a bit undesirable that one should have to use -O
to get decent performance. I've never run across a library or tool where that was the case. Maybe I've just never had occasion to use one that worked that way. Or is there a way to disable asserts on a per module basis? If so, that would probably be a decent way to go.
But maybe it's possible that some of that manual type checking could just be removed altogether since a wrong value type would lead pretty quickly to a fairly obvious error (that's why I suggested the try/except approach)? It just feels a bit like we might be trying to shoehorn the wrong concerns into a dynamically typed language.
eurika! (maybe)
We should be able to get roughly the same level of confidence with type checking via mypy
.
@pipermerriam Yes, I agree with that. Even if third party users of our libraries didn't use mypy, it would still give some guarantees about their internal correctness and the interactions between them.
Well, currently we have type definitions for eth-keys
defined like this: https://github.com/ethereum/py-evm/tree/master/stubs
We can move this stuff to a submodule and add type definitions for anything else which is stable, share the submodule across repositories and get some stronger type safety checks across library boundaries, so even consumers of py-trie/eth-hash/whatever
should be able to still get solid type safety checks.
We should be able to get roughly the same level of confidence with type checking via mypy.
Which comes at the cost of requiring that any module that uses the method to use mypy to get a reasonable developer experience. For example, if a module uses encode_raw()
and sends a str
in as the item (without running mypy checks), we end up in an infinite recursion.
Infinite recursion also happens if running with python -O
in the above example, but at least that was "opt-out" of nicer error messages, rather than mypy, which requires "opt-in" to get nicer error messages.
It still doesn't sit well with me to disable assertions for performance gains.
The
__call__()
andnew()
methods there will check the type of their arguments usingisinstance
, and although that should be a cheap check, these are code paths where performance is paramount given the number of times they are called when we perform trie operations on a fully synced DB. For example, when doing a full sync around 10% of the time is being spent into isinstance calls. Even though they're probably not all originating from these checks, I think we should consider removing them as these two methods have type annotations that should make it clear enough what is wrong in case they crash when passed an argument of the wrong type