Slow Performance, Low Resource Utilization

shadefc09 commented 3 years ago

I have bigdata running on a high performance computer and am not seeing the resource usage I expect.

I'm running a program to load RDF documents to the store. When I have just a few threads running, say 6, the store keeps up. However, when I increase to 30, 50, 400 the store begins to have trouble keeping up. At 400 threads there is a 30 second delay between the call to insert data to bigdata and the response.

During this time, the load average never exceeds 8 and the RAM is around 30 or 40 GB. This is on a machine that has 32 cores (4 x Intel Xeon Gold 6234) and 378 GB of RAM (DDR4, 2933 MHz). The disk IO is 800 MB/sec (after encryption, before encryption it is ~3GB/sec). I would expect to see alot more of the resources consumed before performance takes a hit.

I have been tuning the branching factors and other items mentioned in the documentation. I have a couple of questions that I could not find answers to in the wiki page:

The documentation mentions some items relating to RAM size and offers generic suggestions for machines with a large amount of RAM. For a machine with as much RAM as this, what would be the appropriate buffer capacity, write retention queue capacity, and write cache buffers.
I have not found any reference to increasing CPU utilization. Is there a property that allows me to choose the number of threads used per operation (something like what is done in PostgreSQL) or overall?
I checked connection limitation properties for Jetty and the default configuration is unlimited. Is there a network limitation configuration, either through BigData or Jetty, that I need to configure to increase the number of allowed simultaneous connections?

shadefc09 commented 3 years ago

Update to last comment, the node average has been <3 for quite a while now, and that's with a well tuned PostgreSQL db handling many connections from several projects running at the same time. BigData has taken up <1 of the load average, from what I've seen (despite trying to load several hundred RDF documents at a given time).

shadefc09 commented 3 years ago

I've checked the network utilization and it looks like at most only 150Mb/s of 1Gb/s is being used, so I don't thing the network is the bottle neck.

ben-j-herbertz commented 3 years ago

What kind of data are you trying to load?

Why do you need 400 Threads?

What performance do you expect?

shadefc09 commented 3 years ago

I'm currently trying to load ~15,000,000 RDFs generated from the NLP processing of documents. It only took ~6 days to process the documents and I calculate from the current rate that it will take ~78 days to load them into BigData. In the long run, I need to be able to load the data as it is processed. So I need to get the loading rate up to 50 - 60 documents/sec.

When doing our NLP processing, we can use as many threads as we can get (the most we've done is ~250, but we could do up to 1,250 with our current resources). We have been using a different graph database for many years and wanted to switch to something faster and more scalable. If we need to branch out to multiple shards in the future, then we are ready to do so. But first we would like to see the resources currently available to BigData be used before adding more.

shadefc09 commented 3 years ago

We are loading RDF data. We have a highly parallelized application running on a high performance node that scales as we need it to do processing and we are loading that data into BigData. Each thread is an individual record being processed in isolation . So with 250 threads, we have 250 documents being processed at once. We must then load those results to BigData. The reason we are using so many threads is that the end of each NLP thread loads BigData with the results and we are trying to catch up because we migrated to blazegraph and need to process our entire backlog. What kind of performance should we expect? We anticipated that a 30 second round trip would be unlikely...

From what I can tell, the CPU, memory, disk i/o, network i/o is all low but we are still not getting documents in as fast as we need to. I would expect that at least one resource would be maxed out before performance begins to take a hit.

shadefc09 commented 3 years ago

@thompsonbry - do you have any suggestions for configurations on such a machine to fully utilize the available resources to improve performance when loading data?

Would starting multiple shards on the same machine using the scale-out capability improve performance and resource consumption?

thompsonbry commented 3 years ago

Edward sometimes posts here. He has done a lot of analysis of bottlenecks in the RWStore in the past and can offer some insights. You might try searching on the list for his messages.

Bryan

On Tue, Apr 13, 2021 at 12:32 shadefc09 @.***> wrote:

@thompsonbry https://github.com/thompsonbry - do you have any suggestions for configurations on such a machine to fully utilize the available resources to improve performance when loading data?

Would starting multiple shards on the same machine using the scale-out capability improve performance and resource consumption?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/200#issuecomment-818999048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YADI64JCXG4YRGW5CLTISL33ANCNFSM42ZV4ESA .

shadefc09 commented 3 years ago

I searched the issues for "edward" and this is the only one that came up. Is that the list you're talking about?

I'm not sure who you're talking about to tag them here, can you tag them for me?

thompsonbry commented 3 years ago

Sorry. Edwin.

On Tue, Apr 13, 2021 at 13:01 shadefc09 @.***> wrote:

I searched the issues for "edward" and this is the only one that came up. Is that the list you're talking about?

I'm not sure who you're talking about to tag them here, can you tag them for me?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/200#issuecomment-819015643, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YHM2J7AGRTMVW5GWEDTISPJJANCNFSM42ZV4ESA .

shadefc09 commented 3 years ago

I searched the issues for "edwin" and this is still the only one that came up.

thompsonbry commented 3 years ago

When you say this, I don’t see a link?

On Tue, Apr 13, 2021 at 17:58 shadefc09 @.***> wrote:

I searched the issues for "edwin" and this is still the only one that came up.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/200#issuecomment-819146239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YGE7JNNTBEN2V3PKKDTITSB7ANCNFSM42ZV4ESA .

thompsonbry commented 3 years ago

I found this title

[blazegraph/database] Multiple Servers on one Wikidata instance (#172)

On Tue, Apr 13, 2021 at 17:58 Bryan B. Thompson @.***> wrote:

When you say this, I don’t see a link?

On Tue, Apr 13, 2021 at 17:58 shadefc09 @.***> wrote:

I searched the issues for "edwin" and this is still the only one that came up.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/200#issuecomment-819146239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YGE7JNNTBEN2V3PKKDTITSB7ANCNFSM42ZV4ESA .

shadefc09 commented 3 years ago

sorry, by "this" I meant this issue we're currently in. Thanks for the link!

shadefc09 commented 3 years ago

@edwinchoi - I looked at the post above and see that you were able to get "15k mutations/sec". I'm currently loading<3 documents/sec, each with 100s - 1000s of statements. This would be at most <3k individual mutations/sec.

What were your various configurations for your deployment (i.e. buffer capacity, write retention queue, heap size, max direct memory size, etc.)?

shadefc09 commented 3 years ago

I'm just trying to figure out how to configure BigData to utilize the available resources.

For example, on the Performance Optimization page, it talks about BigdataSail.Options.BUFFER_CAPACITY. When I click on that link it says The best performance will probably come from small (10k - 20k) buffer capacity values combined with a queueCapacity of 5-20. Larger values will increase the GC burden and could require a larger heap, but the net throughput might also increase. But on the performance page it has the example property as com.bigdata.rdf.sail.bufferCapacity=100000. So would the "queueCapacity" be set by com.bigdata.rdf.sail.queueCapacity?

This is just an example my confusion with the configurations. There is some information about adjusting parameters to increase disk and memory utilization and performance. But the recommendations don't appear to be for resources as large as I have available and the details of the exact properties can be confusing (like above). Also, I've seen people point to the java docs, like the link above, which makes me thing there are alot more settings than what is described in the wiki, but I don't know how to use them.

Also, the only configurations I've seen in the wiki have to do with disk IO and memory. Are there any configurations to increase the CPU utilization?

GreenReaper commented 3 years ago

I noticed that's a 4P machine - perhaps -XX:+UseNUMA would help? Apologies if you've already considered/done this, was just thinking that those might require large memory transfers and it'd be better to have that access local than remote.

There also seem to be some write performance optimizations mentioned in #114. I don't know if any apply to your situation.

(It'd be nice if the wikis were editable for people to put stuff in; however, I understand they got spammed like JIRA.)

edwinchoi commented 3 years ago

Blazegraph is highly latency-sensitive, What matters is not how many bytes you can write to disk, but how long it takes to write 1-byte (not literally). Writes are randomized, and Posix I/O is slow.

The sail buffer capacity impacts how many statements you're able to buffer before the connection writes your changes to disk.

Don't use Java's Reader implementations. They generate a lot of garbage. I believe Jackson has an impl, as does Xerces that can take an input stream and read UTF-8 encoded data w/o generating an insane amount of garbage.
Sesame's RDF parsers are extremely inefficient. I ended up rewriting them to use a shared StringBuilder where appropriate, then found that RDF4J does the same. Look at their code-base for inspiration.
Don't expect it to be fast out of the box. You have to optimize the code for your specific use case. For example, there's the option of inlining URIs - this did not help us whatsoever because of the shape of our data.
There is ultimately a single point of contention: RWStore. Writes are always handled by a single thread, so if your utilization is <1, my guess would be that your I/O is too slow. There is a way to disable fsync, but that comes with the dangers of non-durable wriites.
Use JMC to profile the application. It won't give you any native frames, but monitor allocations. If you see a lot of byte[] or char[] allocations (I'm talking many orders of magnitude higher than anything else), see if you can get those done. JMC will also give you a picture of the efficiency (time spent executing vs. GCing).

Are you running off SSD over SATA, NVMe, or HDD? Run FIO with purely random writes to see what the latency is, whatever that figure is will be proportional to the rate at which you're able to insert.

The rate you're getting seems awfully slow... and the optimizations won't get you far. It seems like something else is going on. Personally, we do not use DataLoader - we hand wrote our own that bypasses the term buffers in the sail connection. That would still not explain the orders of magnitude difference that you're observing. Edit: our custom loader also separates the parsing of the content from the mutation. I believe a change was made to DataLoader to do this as well, but little things like interning terms goes a long way.

@GreenReaper that flag did not make much of a difference in our case. After running Intel's Memory Latency Checker, the cross numa node latency, while higher, was still within reason. It could explain minor variations in performance, but unless the xbar impl for numa nodes was such that n0 had to go through n1 to read n2's memory, you have a single link variance in performance. And again, CPU and memory was not what hurt performance the most.... it was writes to disk.

If you're on Linux, run iostat -tmxy 3s while the load is happening.

Edit: also check how your writes are organized. If you aren't clustering on S, P, or O... then you're entering the pathological case for every index. Increasing the writeRetentionQueue would help, but you'd have to swap out the impl to use an LRU cache instead of a ring buffer (required ordering is still guaranteed). What's the profile on your data look like? How many distinct S? How many distinct P? Do S and P co-occur? And how many distinct O? Are the O links, mostly plain literals, or numbers?

Also assuming that you do not have justifications enabled. You should be running with RWStore and quadsMode with free text search disabled (if you need this capability, then disable it for now at least to isolate the cause).

blazegraph / database

Slow Performance, Low Resource Utilization #200