porcuquine commented 5 years ago

Description

Preprocessing, i.e. io::fr32::write_padded needs to be optimized so it runs as much faster as possible. All time spent preprocessing must be included in the eventual cost budget of replication, so we should drive it as close as possible to the cost of simply writing the data bits to disk.

Background

Performance issues and benchmarks: #207.

Please run these benchmarks on your development machine and add a copy of relevant 'before' and 'after' to the PR so we can quickly see what kind of results are achieved.

Tracking and suggested starting points

See #160 (Preprocessing section).

Acceptance Criteria

All existing tests pass with new code.
Preprocessing is significantly faster.
Code is clear and can be reasoned about.
On balance, tradeoffs suggest incremental improvements are on the right path to an eventual, optimal solution (subjective…).

Suggestion

Strongly consider implementing this in such a way that the existing functionality is retained and writing a test which ensure the optimized code has effect identical to existing code — for random data of all sizes. We believe the current implementation is correct, and there are many subtle ways this work can go wrong. Whether the current 'reference implementation' need be retained longer-term can be sorted out later, but writing this way may help you write the code and make it easier to evaluate the work.

If you find your implementation and the current one differ in some way, it is also possible that the current implementation is flawed. In that case, before continuing you should:

create an issue tagged bug, which clearly demonstrates the problem;
create a good test reproducing the problem
create a PR that includes the test and makes it pass.

If you find but can't fix the problem, still create the bug first; and reference a PR which includes the test — which someone else can take over to fix the bug.

porcuquine commented 5 years ago

Bonus: @schomatis, when you start work on this set the Pipeline to 'In Progress' and verify that the issue is moved to the right column on the board.

schomatis commented 5 years ago

Researching simple ways to profile Rust (suggestions welcomed). Besides the cargo bench and similar tools to measure the time it takes to do the entire padding process it would be nice to have a tool that could expose the most expensive calls (bottlenecks) in the code, ideally I would want something simple like the pprof tool for Go which in a few lines allows to produce (for example) a descriptive call graph with CPU times.

That is, the main optimization seems to be the bit processing part (avoiding BitVec and using just bit shifting since BitVec relies on the expensive Iterator interface for most of the heavy lifting), and while that makes plenty of sense it would be nice to have a confirmation before actually starting to replace BitVec all over the place (or at least during that process) at the method-call-level. (Mostly a desire, not a hard requirement.)

schomatis commented 5 years ago

In BitVec, when moving chunks of bits (extend) we process (push) one bit at a time,

https://github.com/myrrlyn/bitvec/blob/cc052ff5972045457967d5e20fd64ec5f2295bbf/src/vec.rs#L871-L873

which seems like an expensive operation,

https://github.com/myrrlyn/bitvec/blob/cc052ff5972045457967d5e20fd64ec5f2295bbf/src/vec.rs#L173-L189

https://github.com/myrrlyn/bitvec/blob/cc052ff5972045457967d5e20fd64ec5f2295bbf/src/vec.rs#L408-L427

porcuquine commented 5 years ago

Researching simple ways to profile Rust (suggestions welcomed). Besides the cargo bench and similar tools to measure the time it takes to do the entire padding process it would be nice to have a tool that could expose the most expensive calls (bottlenecks) in the code, ideally I would want something simple like the pprof tool for Go which in a few lines allows to produce (for example) a descriptive call graph with CPU times.

@dignifiedquire Can you point @schomatis in the right direction?

dignifiedquire commented 5 years ago

on macos and linux you can use https://github.com/dignifiedquire/rust-gperftools exapmle usage is in filecoin-proofs/examples/drgporep-vanilla.rs

If you are on linux you can also use valgrind or perf: https://gist.github.com/KodrAus/97c92c07a90b1fdd6853654357fd557a On 9. Jan 2019, 04:37 +0100, porcuquine notifications@github.com, wrote:

Researching simple ways to profile Rust (suggestions welcomed). Besides the cargo bench and similar tools to measure the time it takes to do the entire padding process it would be nice to have a tool that could expose the most expensive calls (bottlenecks) in the code, ideally I would want something simple like the pprof tool for Go which in a few lines allows to produce (for example) a descriptive call graph with CPU times. @dignifiedquire Can you point @schomatis in the right direction? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schomatis commented 5 years ago

I ended up using cargo-profiler (I didn't get the FlameGraph SVG output format and cpuprofiler seemed interesting but a bit more involved to set up), although it doesn't seem to be maintained so I'm not sure I would recommend it. It was enough for the current exploratory task because I'm vaguely familiar with Callgrind (invoked with these args) and (when keeping the callgrind.out file) I could use kcachegrind for a simple visualization.

Profiling the unit tests (not the actual benchmarks which would take much more time) and extracting the test binary:

uname -a
# Linux 4.13.0-36-generic #40~16.04.1-Ubuntu SMP x86_64 GNU/Linux

# In `rust-proofs/sector-base/src/io/`
cargo test --target-dir fr32-test-dir
cd fr32-test-dir/debug
cargo profiler callgrind --bin ./sector_base-8984ec3eef40e49c

Tried to use rustfilt to demangle the callgrind.out file but it didn't work (not sure if I'm using it correctly),

rustfilt -i callgrind.out  -o callgrind-demangled.out
diff callgrind.out callgrind-demangled.out
# Nothing.

The output seems to confirm the bottlenecks in BitVec, particularly do_with_tail and push,

70,205,544 (6.2%) vec.rs:_..alloc..vec..Vec..T....as..core..ops..deref..Deref..::deref
-----------------------------------------------------------------------
70,173,072 (6.2%) ???:_..bitvec..vec..BitVec..E$C$..T....as..core..ops..deref..Deref..::deref
-----------------------------------------------------------------------
59,839,110 (5.3%) ptr.rs:core::ptr::_..impl....mut..T..::is_null
-----------------------------------------------------------------------
58,580,340 (5.2%) mod.rs:core::slice::from_raw_parts
-----------------------------------------------------------------------
51,954,071 (4.6%) slice.rs:_..bitvec..slice..BitSlice..E$C$..T....::len
-----------------------------------------------------------------------
41,342,080 (3.6%) ???:_..bitvec..vec..BitVec..E$C$..T....::do_with_tail       <-------
-----------------------------------------------------------------------
40,089,220 (3.5%) mod.rs:core::slice::_..impl....T....::len
-----------------------------------------------------------------------
35,817,597 (3.2%) raw_vec.rs:_..alloc..raw_vec..RawVec..T$C$..A....::ptr
-----------------------------------------------------------------------
33,394,156 (2.9%) ???:_..bitvec..vec..BitVec..E$C$..T....::push                 <-------
-----------------------------------------------------------------------
28,686,072 (2.5%) bits.rs:bitvec::bits::Bits::set
-----------------------------------------------------------------------
28,623,954 (2.5%) ???:_..bitvec..vec..BitVec..E$C$..T....as..core..ops..index..Index....usize$C$..u8......::index
-----------------------------------------------------------------------
21,373,936 (1.9%) bit.rs:_..u8..as..core..ops..bit..Shl..::shl
-----------------------------------------------------------------------
21,056,160 (1.9%) bits.rs:bitvec::bits::Bits::join
-----------------------------------------------------------------------
20,766,398 (1.8%) bits.rs:bitvec::bits::Bits::get
-----------------------------------------------------------------------

Seems enough to start coding a simple bit-shifting alternative to compare performance.

Note, I'm using a different profile test (taken from the Rust book) which is supposed to help when debugging, haven't checked all these options so they may have an impact on the benchmark (especially the reduced optimization level, opt-level), however since I'm just doing a very small PoC benchmarks (to get a rough idea of the code hot spots) it seems benign enough to keep it throughout these exploratory tests (the actual truth of the optimization results will be judged by cargo bench --all preprocess --release)

# The testing profile, used for `cargo test`.
[profile.test]
opt-level = 0
debug = 2
rpath = false
lto = false
debug-assertions = true
codegen-units = 16
incremental = true
overflow-checks = true

porcuquine commented 5 years ago

As long as you're aware there are likely to be non-trivial differences between optimization levels, I don't see an issue. I'd definitely try to profile against --release before drawing conclusions.

Most importantly, I think we're in agreement that the first actual change is moving to bit-shifting, so there's less burden on the profiling (i.e. don't sweat it for now). It may become more important later when using it to sniff out further optimizations.

Also, I think I mentioned this elsewhere but perhaps ephemerally: I suspect we can do even better than shifting a byte at a time (rather than pushing a bit). If we process chunks of the largest unsigned integer (interpreted with the correct endianness) as we can and bit-shift that, we can reduce those operations by another factor. i.e. if we operate on u16, that's half as many shifts as if we operate on u8. I think to maximize that and get very large (I think u128 is the limit) integers we may need to opt into an experimental feature. Either way, I wanted to put that on your radar. If you design the shifting optimization to be generic over size of the unsigned int, that will make it easier to observe the effect of changing it and let us defer the decision of what the best size actually is.

dignifiedquire commented 5 years ago

Profiling the unit tests (not the actual benchmarks which would take much more time) and extracting the test binary:

I think profiling the unit tests is risky, as there is likely a lot of unnormalized noise in there.

Using opt-level = 0 is suspicious to me, as this means there are no optimizations, making your profile not really accurate in terms of realistic usage.

If you use gperftools the overhead is a lot lower than with valgrind, so you can easily run it with the benchmarks and get more normalized stats.

I can help you get it running, if you run into any issues with it

dignifiedquire commented 5 years ago

Looking at those extend and push operations, it seems as they could be optimized. But I think it would be better to improve this perf in the library, rather than adding a bunch of shifts into our code if we can. This would have the benefit of speeding up perf in other places where we use this lib & allow us to continue to use a nice abstraction

schomatis commented 5 years ago

Most importantly, I think we're in agreement that the first actual change is moving to bit-shifting, so there's less burden on the profiling (i.e. don't sweat it for now).

Agreed, this was mostly a toy exercise to get me started in the profiling tools.

If we process chunks of the largest unsigned integer

Yes, my idea is to implement the basic bit manipulation algorithm seen in any amd64 architecture.

(interpreted with the correct endianness)

I do meant to ask, since I've seen we pack in a LittleEndian format at the bit-level, but we work with inputs/outputs (guessing here) in a big-endian ordering, that does sound like an expensive conversion (I'll need to check what algorithms are available for that).

But I think it would be better to improve this perf in the library, rather than adding a bunch of shifts into our code if we can.

Agreed. Although I think any iterative algorithm would ultimately under-perform any bit shift at the register level (ignoring the LittleEndian issue in this argument), but what (ideally) I would like to do is provide an extend variation that would work over slices (instead of iterators), I think that would be the natural place of the bit shifting logic we're talking about, since BitVec is just a wrapper around a Vec of Bits elements (in the BitVec terminology) that could be the u64 integers being proposed before (assuming here -without any concrete knowledge- that Vec is not much more expensive than a &[u64] slice). That being said I have no idea how contributing to an independently-developed crate would work, or what would be the RTT, I'm not a fan of just forking also, but either way this is something that will need to be discussed in a https://github.com/myrrlyn/bitvec issue.

(The previous paragraph is just a continuation of the proposed idea but it's non blocking, I can start the implementation in the fr32 module and later move it to BitVec if we decide on that without much cost I think.)

schomatis commented 5 years ago

I do meant to ask, since I've seen we pack in a LittleEndian format at the bit-level, but we work with inputs/outputs (guessing here) in a big-endian ordering, that does sound like an expensive conversion (I'll need to check what algorithms are available for that).

While studying the BitVec implementation (to replicate it in https://github.com/filecoin-project/rust-proofs/pull/435/commits/b91eef8dca9598116e07a6cce2e7b5aae2b6b587) I realized there is no need to do any bit inversion. The semantics of this LittleEndian cursor are different from the typical endianness in applications or networking (when bytes may need to be swapped around), here it just means iterate bits in a byte form right (LSB) to left (MSB). Since BitVec reads and writes in that same order the difference between LittleEndian and BigEndian is just interleaving the zero padding bits on the right instead of on the left of the output byte.

porcuquine commented 5 years ago

Yes, sorry I didn't respond to your earlier comment. The endinanness in BitVec is at the bit level. This won't matter for what you're doing, but to make sure we're synchronized in understanding, I think choice of endianness also affects the direction (left vs right) of the bit-shift we will need to perform.

schomatis commented 5 years ago

I think choice of endianness also affects the direction (left vs right) of the bit-shift we will need to perform.

Yes,

https://github.com/filecoin-project/rust-proofs/blob/171e788daa5dc522f55b845f9951de4013de7ffd/sector-base/src/io/fr32.rs#L421-L444

porcuquine commented 5 years ago

As far as the present work is concerned, what I am suggesting is that we not use BitVec at all. Instead, we read multiple bytes at a time, remove the 'remainder bits' and add to the previously-processed chunk, then bit-shift the entire multi-byte chunk as a single operation.

I believe in order for this to work, we will need to treat the chunks as big-endian unsigned integers and left-shift.

Here's a sketch of what I mean, using u16 for simplicity. Here is the raw data, represented in binary:

[1 0 1 0 0 0 0 0] [ 1 0 1 0 0 0 0 0] [1 …]

Let's say we want to process the first bit as part of the previous chunk, then shift everything else down. Our goal is to end up with the following two-byte chunk:

[0 1 0 0 0 0 0 1] [ 0 1 0 0 0 0 0 1]

For the sake of clarity, let's look use chunks of u8 — in which case endianness is irrelevant. Initially, we have:

[160] [160]

and we want to transform this into:

[65] [65]

In that case, we would extract the first bit of the first byte, then left-shift that bye by one bit. This yields 64. We extract the first bit of the next byte and OR the 1 into this result, giving 65.

Now consider the big-endian case. Both bytes can be represented in a single chunk:

As before, we extract the first bit to process, then left-shift one bit and the 1 from the next chunk to get 16705 =

[0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1]

If we instead try that whole operation with a little-endian interpretation of the bytes read from the raw data, then the bit-shifting won't align with our requirements. This is because the internal 'bit-endianness' (in the sense of BitVec is always big-endian for integers — assuming a left-to-right convention, where 'left' is 'first'). Left-shifting by one bit always doubles the value of an integer (ignoring overflow), never the reverse.

~Therefore, for our purposes, if we move away from use of BitVec, we probably need to use a big-endian representation. In a little-endian representation, the shift necessary to move bits 'left' increases the value of the underlying integer, but that same shift moves bits (across byte boundaries) in the direction of decreasing value. It's a little confusing, and this interpretation depends on wanting compatibility with our existing implementation. However, since we initially took this approach in order to clarify semantics and gain confidence in our algorithm, and also for the sake of consistency, that seems like the right decision.~

Does that make sense?

schomatis commented 5 years ago

Does that make sense?

I think that https://github.com/filecoin-project/rust-proofs/pull/435 (PR containing the cited code) is pretty much aligned to what you're describing here although it's a WIP, but allow me some more time to complete it and tidy it up and re-check with this description later.

schomatis commented 5 years ago

For the sake of clarity, let's look use chunks of u8 — in which case endianness is irrelevant.

This seems to be the case but because in the example we have two bytes of equal value.

Now consider the big-endian case. Both bytes can be represented in a single chunk:
411200

How is this chunk interpreted? It seems to exceed 65K limit of an u16.

I do agree that using larger integers instead of u8 can work, but that would seem to depend on the underlying architecture, i.e., how does the processor lays the bytes to form a say u32. (This is the part I don't follow in the argument, I'll give it another read/thought:) It seems that we need a little endian architecture to take advantage of that big shift: moving the input_pos byte (from the documentation) to the right of the input_pos + 1 byte, that has the bits that are currently being shifted to the left of the ones extracted from input_pos. Since little endian is the majority of the platforms (I'm assuming) we'll be running this on we could do an optimized version that if it detects that endianness it joins the u8 into a u32/u64.

porcuquine commented 5 years ago

Sorry, typing error — should have been 41120. I edited the original to remove the extra zero.

After thinking about it more, I think you're right that we should match the underlying hardware and choose the direction of the bit shift to be consistent with that. We just need to take care (and write tests ensuring) that we return consistent results in all cases.

Sorry for confusion above and any ongoing. I need to stop and think this all the way through. Better yet, I'll just pay close attention as you do…

porcuquine commented 5 years ago

Okay, after yet a little more thought, I think I have this straight. Let me know if this sounds right:

We must be little-endian by the following logic:

We have to be able to write incrementally.
We will never overwrite previously written 'complete bytes'.

I think this is a different way of stating your point about padding at the beginning or at the end. We want it at the end.

It's not yet clear to me that we cannot use the multi-byte integer optimization if the underlying architecture is big-endian. As long as we both read and write using little-endian methods, we should be able to manipulate the in-memory integer identically and without worrying about how it is represented.

If I'm missing something, let me know. In any case, I think it's fine to ignore this optimization and start out with u8 if it's simpler.

schomatis commented 5 years ago

I think it's fine to ignore this optimization and start out with u8 if it's simpler.

Agreed, that was my rationale behind #435, lock down the single-byte optimization, with a concrete implementation I think it will be easier to discuss the multi-byte optimization.

It's not yet clear to me that we cannot use the multi-byte integer optimization if the underlying architecture is big-endian.

I'm beginning to suspect we may be thinking about different optimizations (or at least at different levels), especially since I'm new to Rust. The optimization I'm thinking of is at the processor level where instead of manipulating one byte at a time we load, say, 4 bytes altogether with some LOAD_WORD instruction that will take them and organize them in a single register where we can apply just one SHIFT_WORD. In this scenario I think we can agree that only one of the little or big endian orderings will lay the 4 independent bytes in a way that the shift makes sense (we can ignore which one in this argument) and that will depend solely on the processor architecture (that is, in one case we'll be able to apply that optimization and in the other we won't).

This thinking is assuming that the Rust code will ultimately match roughly those instructions, but since I haven't really studied the transmute and related functions there's a good chance I'm missing a bigger picture here.

filecoin-project / rust-fil-proofs

Optimize Preprocessing #397

Description

Background

Tracking and suggested starting points

Acceptance Criteria

Suggestion