Optimize CBlockHeader methods in messages.py

castarco commented 5 years ago

This commit optimizes the method calc_sha256 of CBlockHeader by caching expensive computations and avoiding storing intermediate results.

CPython does not properly optimize its bytecode output (this can be seen using dis.dis to "disassembly" the functions), so, just by avoiding consecutive "concatenate & assign" operations, we can save up to 100ns per call (using an iCore9, 8th gen). Of course, caching hash computations saves us more time.

Hopefully, this will help to decrease build times a little bit, it won't hurt though.

Signed-off-by: Andres Correa Casablanca andres@thirdhash.com

cornelius commented 5 years ago

Do you have numbers what effect this has on overall build/run/test times?

Would this be something which should also be sent upstream?

castarco commented 5 years ago

Do you have numbers what effect this has on overall build/run/test times?

Not a serious benchmark, but yes, I have numbers, I'll elaborate on them later, just focused on other stuff for now. Besides the time numbers, I can also present the disassembled code.

Would this be something which should also be sent upstream?

Yes, definitely.

cornelius commented 5 years ago

I agree that the code is nicer. The only caveat I see is that it might make syncing with upstream harder. That's why it would be ideal if the code would be upstream as well. For that it will need to have a real performance impact, prettier assembler code probably won't be a strong enough reason to do a change. That's why I asked for numbers. I think something qualitative to judge the impact would be enough.

castarco commented 5 years ago

I agree that the code is nicer. The only caveat I see is that it might make syncing with upstream harder. That's why it would be ideal if the code would be upstream as well. For that it will need to have a real performance impact, prettier assembler code probably won't be a strong enough reason to do a change. That's why I asked for numbers. I think something qualitative to judge the impact would be enough.

Ok, I'll split this PR into two this night, to separate the hash calculation caching from the other changes with less obvious advantages. I'll try to provide better data on why they are convenient too.

I have other nice improvements in mind too, also easy to support with data.

castarco commented 5 years ago

Rebased, reduced the scope of this PR.

castarco commented 5 years ago

The improvement will be minimal, but I see no reason to condition its integration on how big is its impact. There are more reasons besides performance: cleaner code & having better code as example (because people tend to copy code, replicating the bad & the good stuff). In fact, messages.py is clearly the result of a ton of copy&paste "work".

Although I understand that that we condition merging it on having the same patch applied to Bitcoin.

Here I write the results of a micro-benchmark showing the times for three variants (f1 is the proposed one, I removed a conditional to see the change effects in the micro-benchmark loop).

In [18]: def f1(self):
    ...:     r_hash = hash256(
    ...:         struct.pack("<i", self.nVersion) +
    ...:         ser_uint256(self.hashPrevBlock) +
    ...:         ser_uint256(self.hashMerkleRoot) +
    ...:         ser_uint256(self.hash_witness_merkle_root) +
    ...:         struct.pack("<I", self.nTime) +
    ...:         struct.pack("<I", self.nBits) +
    ...:         struct.pack("<I", self.nNonce)
    ...:     )
    ...:     self.sha256 = uint256_from_str(r_hash)
    ...:     self.hash = encode(r_hash[::-1], 'hex_codec').decode('ascii')

In [31]: def f2(self):
    ...:     r = (
    ...:         struct.pack("<i", self.nVersion) +
    ...:         ser_uint256(self.hashPrevBlock) +
    ...:         ser_uint256(self.hashMerkleRoot) +
    ...:         ser_uint256(self.hash_witness_merkle_root) +
    ...:         struct.pack("<I", self.nTime) +
    ...:         struct.pack("<I", self.nBits) +
    ...:         struct.pack("<I", self.nNonce)
    ...:     )
    ...:     self.sha256 = uint256_from_str(hash256(r))
    ...:     self.hash = encode(hash256(r)[::-1], 'hex_codec').decode('ascii')

In [32]: def f3(self):
    ...:     r = b''
    ...:     r += struct.pack("<i", self.nVersion)
    ...:     r += ser_uint256(self.hashPrevBlock)
    ...:     r += ser_uint256(self.hashMerkleRoot)
    ...:     r += ser_uint256(self.hash_witness_merkle_root)
    ...:     r += struct.pack("<I", self.nTime)
    ...:     r += struct.pack("<I", self.nBits)
    ...:     r += struct.pack("<I", self.nNonce)
    ...:     self.sha256 = uint256_from_str(hash256(r))
    ...:     self.hash = encode(hash256(r)[::-1], 'hex_codec').decode('ascii')
    ...:     

In [33]: %timeit f3(h)
100000 loops, best of 3: 16.8 µs per loop

In [34]: %timeit f2(h)
100000 loops, best of 3: 16.2 µs per loop

In [35]: %timeit f1(h)
100000 loops, best of 3: 14.5 µs per loop

The respective number of bytecode opcodes is 130 for f1, 134 for f2, and 166 for f3.

P.D.: I show some extra data here because I was asked for it, but I see every day how some micro-optimizations are integrated without no one asking for numbers (although hidden in non-related PRs). The changes here are not esoteric at all, and don't depend on spurious implementation details. It's quite easy to see why they decrease times.

We could discuss if the impact is significant. I can tell you: no, it's not. But as I commented before, that's not my only reason to apply the changes.

P.D.2: I didn't want to mix these changes with others of completely different nature (like fixing tests). I could follow different approaches: making a big patch that generates a huge diff (then, for sure, the impact would be noticeable), or to make sporadic and simple PRs to provide incremental & small improvements.

In the first case, could be argued that I'm spending too much time on "aesthetic changes" while I have other stuff more important in my backlog. That's why I preferred to do a simple patch.

In the second case, I'll never gather enough evidence to sustain performance improvements as required unless I perform exhaustive experiments that will require a shit ton of time, and then the effort does not compensate the gains.

scravy commented 5 years ago

I do not see the amount of effort/time invested in this discussion justified by the value of the code in this pull request, which I do find debatable. I am a bit late to the discussion and this branch has been squashed/rebased/force-pushed so I am not sure which comment comments on what exactly and which version of the code looks better how exactly.

So after some digging I see this commit, which apparently was factored out (btw, I was getting rapped on on my buckles for squashing/force-pushing because it renders existing comments invalid and confuses late-comers to the party) into a separate pull request which is not there yet.

Anyhow, i find this: https://github.com/dtr-org/unit-e/commit/fc85ee6f52c2458a49e9e03261bfda53bf64661a

And I see this:

def serialize(self):
    r = b""
    r += self.locator.serialize()
    r += ser_uint256(self.hashstop)
    return r

being replaced with this:

def serialize(self):
    return (
        self.locator.serialize() +
        ser_uint256(self.hashstop)
    )

Personally I do find the former (original) version way better from a readability and maintainability point of view. The first version tells me: This code is dealing with a binary string here. The latter version tells me: Anything can happen here, maybe I'm adding two numbers. The first version can be extended by adding just one line, the second version requires me to also add a + in the previously last line. The first version allows me to set a breakpoint on the variable, the latter doesn't.

I could care less about micro optimizations geared towards a specific python implementation.

Neither NACK nor ACK from my side. I tend to like the original version better and I do consider this whole discussion close to pointless. Sorry.

scravy commented 5 years ago

P.D.: I show some extra data here because I was asked for it, but I see every day how some micro-optimizations are integrated without no one asking for numbers (although hidden in non-related PRs). The changes here are not esoteric at all, and don't depend on spurious implementation details. It's quite easy to see why they decrease times.

This sounds a bit sour. I do prefer my pull request discussions sweet and chipper :-P

We could discuss if the impact is significant. I can tell you: no, it's not. But as I commented before, that's not my only reason to apply the changes.

From my point of view that's the only reason for these changes, as in a lot of other metrics I find the existing versions superios - as outlined in my previous comment.

castarco commented 5 years ago

I do not see the amount of effort/time invested in this discussion justified by the value of the code in this pull request

I agree.

I could care less about micro optimizations geared towards a specific python implementation.

The changes here are not esoteric at all, and don't depend on spurious implementation details.

This sounds a bit sour. I do prefer my pull request discussions sweet and chipper :-P

https://github.com/dtr-org/unit-e/pull/721#issuecomment-469053880

dtr-org / unit-e

Optimize CBlockHeader methods in messages.py #825