filecoin-project / go-data-transfer

Data Transfer Shared Component for go-filecoin & go-lotus
Other
39 stars 17 forks source link

investigate low performance and CPU hotspot transferring 4GiB over loopback #166

Open raulk opened 3 years ago

raulk commented 3 years ago

This is an initial observation from running the Whitenoise tests with a perfect network and no interruption rate, over the loopback interface. In other words, these are ideal conditions for a data transfer.

The total payload size that was transfered was 4GiB.

The transfer stack was go-data-transfer with graphsync, conducting a push from the sender to the receiver.

The hardware used was a MBP 16", 2,3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4.

Attached is the run result. IMPORTANT: this is really a gzip, but GitHub only allows uploading zips: change the extension to gzip!

c1ah45gi7qkqo2qcbicg.zip

Of special interest are the events files, which capture the subscription stream from go-data-transfer from both the sender and receiver. Very useful for debugging.

There are heap profiles and CPU profiles included too (using Testground's new profiling feature) for further digging.

So far, I found what appears to be a large hotspot on the receiving side. Addressing this hotspot could allow the receiver to ACK data sooner, and thus speed up the data flow.

The hotspot seems to be in the "CID lists" feature, which I'm not familiar with.

profile001

NOTE: If you run the test, make sure you use the feat/tmp-dir branch of Testground, as Whitenoise depends on the new temporary directory feature. This is also noted in the Whitenoise README.

dirkmc commented 3 years ago

When go-data-transfer receives a block, it adds the CID to a list. If the data transfer is restarted, eg because the connection went down, the receiving side sends a list of all CIDs that it has received to the sender, so that the sender can skip over these CID and only resend missing CIDs to the receiver.

I'm not sure why this is showing up in the profile. Are you running multiple tests, with a new datastore each time? The CID list is kept on disk, so it may be that each time you do a test you're writing more and more CIDs to the list, and each time you start the run you're reading the CIDs from all previous tests from the file.

Thinking about CID lists we may want to consider storing the list as a bloom filter. Then if there's a restart during the transfer, we wouldn't need to send all the individual CIDs as a list, we could just send the bloom filter.

raulk commented 3 years ago

This is just a single test, nothing fancy. Things should be configured the way we set them up in Lotus. Could you try running the basic composition in the whitenoise tests and see if you can reproduce?

raulk commented 3 years ago

Yeah, and sending all CIDs scales very poorly. We should never do that!

hannahhoward commented 3 years ago

Hi @raulk -- you're decoding the list of received CIDs every time you get a transfer event. For now, can you remove this line and run again? https://github.com/raulk/whitenoise/blob/master/testplan/main.go#L128. I think our hotspot here may not reflect our regular use. I agree the cid lists need attention.

hannahhoward commented 3 years ago

that's a file i/o decode and read on every event

hannahhoward commented 3 years ago

btw, alternatives to storing CID lists could be as simple as skip N blocks if were just trying to resume.

dirkmc commented 3 years ago

https://github.com/raulk/whitenoise/pull/1 updates whitenoise to point at https://github.com/filecoin-project/go-data-transfer/pull/217 which eliminates the received CIDs bottleneck.

It seems that the transfer is still relatively slow on my machine: 1m41s to transfer 4GB: 40MB/s

My machine specs:

The bottleneck is no longer coming from reading received CIDs. It seems likely it's from hashing.

Next steps:

CPU profile: pprof001 prof-rcvd-cids-fix.zip

aarshkshah1992 commented 3 years ago

@dirkmc I re-ran your branch on my machine and the hashing bottle-neck dosen't really show up on the sender or the reciever.

However, I will re-run this with GS v0.8.0 tomorrow to see if it makes a meaningful difference. gstest.zip

dirkmc commented 3 years ago

@aarshkshah1992 your suspicions seem to be correct, updating to graphsync v0.8.0 doesn't seem to make much difference. I still get a transfer speed of about 40 MB/s. The CPU profile also looks similar:

graphsync-v0 8

badger-gs-v0.8.zip

dirkmc commented 3 years ago

Still with graphsync v0.8 but now with an in-memory datastore:

pprof-in-mem-gs-v0 8

in-mem-gsv0.8.zip

dirkmc commented 3 years ago

Results for graphsync v0.7 with an in-memory datastore:

The transfer time is comparable to graphsync v0.8: 4GB takes 45s: 95MB/s The CPU profile also looks similar:

gs-v0 7-cpu-pprof

gs-0.7-inmem-prof.zip

hannahhoward commented 3 years ago

So folks, I'm just going to leave my results here.

Here are my three runs. There were run on a Mac Minin M1 with 16GB of RAM & SSD. I closed most of my apps before running. I used Badger only.

Right off @raulk's master branch, running v0.8.0 (v0.8.1-rc), with batcher. Total time: 8.005984208s for 4GB. Comes in at about 500MB/S. Full testground output: c34ihmimvdvm12r7rd2g.zip

Dirks v0.7.0 branch, using Badger. Total time: 23.006475083s for 4GB. Comes in at about 173MB/S. Full testground output: :

c34ii6amvdvm12r7rd30.zip

My own branch in PR https://github.com/raulk/whitenoise/pull/4, with Badger. Total time: 8.004791959s, again 4GB transferred. Again at 500MB/S. Actually makes no difference, but resolves the CPU bottleneck

c34ig0qmvdvm12r7rd10.zip

So, both 0.8.1 (unmodified, master branch of this repo) and my branch with SHA fix show a near three fold increase on my machine.

The big difference between versions appeared after I closed all the apps on my machine -- I suspect even with badger we are consuming a lot of memory and going to swap space Running with memory stores on both client and provider saturates my available RAM -- this makes things run much slower for me

The only thing I can offer is if you are running locally close your other apps. This made a huge difference for me.

Happy to post heap profile images thought hey are in the ZIP files.

dirkmc commented 3 years ago

To try to remove any noise from other processes from running on a laptop I ran the profiles on a cloud machine: AWS c5d.12xlarge: 48vCPU / 96 GB RAM

I ran

I've attached

hannahhoward commented 3 years ago

Some thoughts on the discrepancy between your linux node and my node: