investigate low performance and CPU hotspot transferring 4GiB over loopback

raulk commented 3 years ago

This is an initial observation from running the Whitenoise tests with a perfect network and no interruption rate, over the loopback interface. In other words, these are ideal conditions for a data transfer.

The total payload size that was transfered was 4GiB.

The transfer stack was go-data-transfer with graphsync, conducting a push from the sender to the receiver.

The hardware used was a MBP 16", 2,3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4.

The import of the random 4GiB payload into the badger store took 22109ms.
The transfer as perceived from the client took 117000ms (almost 2 minutes).
The average data transfer rate is 35MB/s -- this is very poor for loopback.

Attached is the run result. IMPORTANT: this is really a gzip, but GitHub only allows uploading zips: change the extension to gzip!

c1ah45gi7qkqo2qcbicg.zip

Of special interest are the events files, which capture the subscription stream from go-data-transfer from both the sender and receiver. Very useful for debugging.

There are heap profiles and CPU profiles included too (using Testground's new profiling feature) for further digging.

So far, I found what appears to be a large hotspot on the receiving side. Addressing this hotspot could allow the receiver to ACK data sooner, and thus speed up the data flow.

The hotspot seems to be in the "CID lists" feature, which I'm not familiar with.

profile001

NOTE: If you run the test, make sure you use the feat/tmp-dir branch of Testground, as Whitenoise depends on the new temporary directory feature. This is also noted in the Whitenoise README.

dirkmc commented 3 years ago

When go-data-transfer receives a block, it adds the CID to a list. If the data transfer is restarted, eg because the connection went down, the receiving side sends a list of all CIDs that it has received to the sender, so that the sender can skip over these CID and only resend missing CIDs to the receiver.

I'm not sure why this is showing up in the profile. Are you running multiple tests, with a new datastore each time? The CID list is kept on disk, so it may be that each time you do a test you're writing more and more CIDs to the list, and each time you start the run you're reading the CIDs from all previous tests from the file.

Thinking about CID lists we may want to consider storing the list as a bloom filter. Then if there's a restart during the transfer, we wouldn't need to send all the individual CIDs as a list, we could just send the bloom filter.

raulk commented 3 years ago

This is just a single test, nothing fancy. Things should be configured the way we set them up in Lotus. Could you try running the basic composition in the whitenoise tests and see if you can reproduce?

raulk commented 3 years ago

Yeah, and sending all CIDs scales very poorly. We should never do that!

hannahhoward commented 3 years ago

Hi @raulk -- you're decoding the list of received CIDs every time you get a transfer event. For now, can you remove this line and run again? https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L128. I think our hotspot here may not reflect our regular use. I agree the cid lists need attention.

hannahhoward commented 3 years ago

that's a file i/o decode and read on every event

hannahhoward commented 3 years ago

btw, alternatives to storing CID lists could be as simple as skip N blocks if were just trying to resume.

dirkmc commented 3 years ago

https://github.com/raulk/whitenoise/pull/1 updates whitenoise to point at https://github.com/filecoin-project/go-data-transfer/pull/217 which eliminates the received CIDs bottleneck.

It seems that the transfer is still relatively slow on my machine: 1m41s to transfer 4GB: 40MB/s

My machine specs:

MacBook Pro (13-inch, 2019, Four Thunderbolt 3 ports)
2.8 GHz Intel Core i7
16 GB 2133 MHz LPDDR3

The bottleneck is no longer coming from reading received CIDs. It seems likely it's from hashing.

Next steps:

Update to graphsync 0.8
Profile and compare

CPU profile: pprof001 prof-rcvd-cids-fix.zip

aarshkshah1992 commented 3 years ago

@dirkmc I re-ran your branch on my machine and the hashing bottle-neck dosen't really show up on the sender or the reciever.

However, I will re-run this with GS v0.8.0 tomorrow to see if it makes a meaningful difference. gstest.zip

dirkmc commented 3 years ago

@aarshkshah1992 your suspicions seem to be correct, updating to graphsync v0.8.0 doesn't seem to make much difference. I still get a transfer speed of about 40 MB/s. The CPU profile also looks similar:

graphsync-v0 8

badger-gs-v0.8.zip

dirkmc commented 3 years ago

Still with graphsync v0.8 but now with an in-memory datastore:

The transfer is significantly faster: 4GB takes 45s: 95MB/s
The major bottleneck seems to be hashing:

pprof-in-mem-gs-v0 8

in-mem-gsv0.8.zip

dirkmc commented 3 years ago

Results for graphsync v0.7 with an in-memory datastore:

The transfer time is comparable to graphsync v0.8: 4GB takes 45s: 95MB/s The CPU profile also looks similar:

gs-v0 7-cpu-pprof

gs-0.7-inmem-prof.zip

hannahhoward commented 3 years ago

So folks, I'm just going to leave my results here.

Here are my three runs. There were run on a Mac Minin M1 with 16GB of RAM & SSD. I closed most of my apps before running. I used Badger only.

Right off @raulk's master branch, running v0.8.0 (v0.8.1-rc), with batcher. Total time: 8.005984208s for 4GB. Comes in at about 500MB/S. Full testground output: c34ihmimvdvm12r7rd2g.zip

Dirks v0.7.0 branch, using Badger. Total time: 23.006475083s for 4GB. Comes in at about 173MB/S. Full testground output: :

c34ii6amvdvm12r7rd30.zip

My own branch in PR https://github.com/raulk/whitenoise/pull/4, with Badger. Total time: 8.004791959s, again 4GB transferred. Again at 500MB/S. Actually makes no difference, but resolves the CPU bottleneck

c34ig0qmvdvm12r7rd10.zip

So, both 0.8.1 (unmodified, master branch of this repo) and my branch with SHA fix show a near three fold increase on my machine.

The big difference between versions appeared after I closed all the apps on my machine -- I suspect even with badger we are consuming a lot of memory and going to swap space Running with memory stores on both client and provider saturates my available RAM -- this makes things run much slower for me

The only thing I can offer is if you are running locally close your other apps. This made a huge difference for me.

Happy to post heap profile images thought hey are in the ZIP files.

dirkmc commented 3 years ago

To try to remove any noise from other processes from running on a laptop I ran the profiles on a cloud machine: AWS c5d.12xlarge: 48vCPU / 96 GB RAM

I ran

whitenoise with Graphsync v0.7 (Badger datastore) Transfer time: 13s
whitenoise with Graphsync v0.7 (in-memory datastore) Transfer time: 13s
whitenoise with Graphsync v0.8 (Badger datastore) Transfer time: 13s
whitenoise with Graphsync v0.8 (in-memory datastore) Transfer time: 14s

I've attached

The output from testground
Zip files with all the profiles etc
PNG images of profiles
- Graphsync v0.7 (badger):
- Graphsync v0.7 (in-memory datastore):
- Graphsync v0.8 (badger):
- Graphsync v0.8 (in-memory datastore):

hannahhoward commented 3 years ago

Some thoughts on the discrepancy between your linux node and my node:

the CPU profiles for go-graphsync on the Sender side in your last comment for v0.8 show hashing about around 50% of v0.7 for both memory and badger.
I suspect if you were to run with the branch in https://github.com/raulk/whitenoise/pull/4 you would find the same for the receiver.
For whatever reason, this makes a difference on M1 but not Intel?

filecoin-project / go-data-transfer

investigate low performance and CPU hotspot transferring 4GiB over loopback #166