Reduce heap memory usage

Several changes to reduce the maximum heap memory usage of LS.

The biggest change is to switch from a generated gogo-protobuf parser to a custom parser. This new parser uses a light low-level abstraction of the protobuf wire format to avoid allocating a large []KV slice with all items. Every KV item in this slice used 64 bytes, plus reallocation overhead as items are appended, plus (in 0.3.0) the copied key and value data as individual allocs, plus GC overhead. Instead, we now directly parse the protobuf wire data as we loop over the items one by one. We also take snapshots directly to protobuf wire format, without an intermediate slice representation.

We now limit the number of snapshots that can concurrently be held in memory using the new memory_downloaded_snapshots and memory_decompressed_snapshots configuration options. In the past this number could grow up to the number of unique instance names.

We now preallocate byte slices that are large enough to hold all the data that we will generate. This avoids many subsequent slice reallocations and copies to grow a slice, doubling the size every time.

Finally, we immediately force a garbage collection cycle as soon as we have released a large block of memory.

In local testing with 1M test domains and 6M records and snapshots from 3 instances, this reduced the peak RSS from over 10 GB (in v0.3.0) to under 3 GB. With more instances, the savings should be even higher with the new limits in place.

To verify that we are writing and reading the protobuf format correctly, this PR adds tests that perform roundtrips with the old gogo-protobuf code, and verify that the implementation is correct. The proto file will contrinue to be updated to reflect any new keys, and gogo can automatically generate code to fill fields with tests data, and methods that can compare all fields before and after a roundtrip.

Several other tests and benchmarks have been added.

PowerDNS / lightningstream

Reduce heap memory usage #31