blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
895 stars 172 forks source link

Multiple Servers on one Wikidata instance #172

Open jecummin opened 4 years ago

jecummin commented 4 years ago

I've been working pretty extensively with local copies of Wikidata for the past year or so. I've been trying to find ways to efficiently perform large-scale augmentations and deletions of the data which is in a file called wikidata.jnl. I've tried parallelizing my jobs to send multiple requests to the server simultaneously but that doesn't appear to improve overall runtime performance as the processes appear to bottleneck as they wait for the server to process the multiple requests.

I tried to start an identical copy of the server on a different port so that perhaps I could have two servers processing insertion/deletion requests on the same database (wikidata.jnl). The second server starts up but raises java.lang.Error: Two allocators at same address during start up. Any requests to the server fail with HTTP Error 503. Clearly, the two servers cannot both operate on the same database.

Are there any changes I could make to the configuration to allow this capability? The traceback flows back through some web app files. If configuration of the web application is what disallows simultaneous servers, then could I disable it for the second server? I would have no need for it to generate the web application.

This might be similar to https://github.com/blazegraph/database/issues/171

edwinchoi commented 3 years ago

You cannot safely modify shared data concurrently without some form of synchronization. When in the same process, it's easy- Java provides the synchronization primitives. For multiple processes, there is no database system I know of that allows multiple processes to share ownership of the same data.

Blazegraph does not allow multiple writers on the same instance, by using multiple processes you're circumventing the guards (likely your OS/file system doesn't support flock, as the exception would have been completely different). Depending on what consistency guarantees you require, and how your data is organized (multiple independent graphs ideally), you could partition across data boundaries (I would stray from trying to use their "scale-out" mode - in my case, the performance was several orders of magnitude worse).

Re: performance issues, my company has been using Blazegraph for several years now... and handling incremental changes and allowing data growth without having to scale vertically has been a recurring struggle (current is 28 core, 512 GB RAM, RAID10 SSD array for a 3B quad KB... we have to control any sudden influx of data since that can cause backlogs that take hours to clear -- on average we see 15k mutations/sec (insert and remove)). If we ever had to rebuild from scratch, in case of a catastrophic failure, it would take several weeks to load from scratch; and this did happen when we ran into #114 (lack of overflow detection allows the engine to overwrite live data, leading to unrecoverable corruption... root block corruption shows up quickly, allocation state corruption shows up at query time).

I spent a considerable amount of time finding the bottlenecks and redesigning the internals, which enabled us to reach a 3x improvement in update throughput. I can't provide the actual code (service is only used internally, and I work in a heavily regulated industry). I can provide you some pointers though:

  1. BTree.writeRetentionQueue uses HardReferenceQueue <: RingBuffer. I've only seen ring buffers used in producer/consumer scenarios, and this is definitely not one of them. Ignore the documentation, there is no LRU policy... that only allows containment checks over a small window. Because of the lack of containment checks, the queue will quickly fill up with duplicate instances, resulting in high write amplification. A simple way around this is to add an IdentityHashMap to check for existence instead of nscan. Otherwise, you need to ensure that children are always written before their parents. First things first though, I would suggest adding some basic stats on cache hits/misses/policy-based evictions (both dirty and clean nodes) and expose them in the metrics (don't include the call to clear made on commit), you'll see that for any index where your access patterns are sparse, the policy-based evictions increase dramatically -- root eviction writes all the dirty children, which will remain in the queue with the existing policy. If the sub-optimal buffering strategy is the issue, you should them in the metrics (ideally, the buffer write should only happen on commit).
  2. If a large portion of your data consists of terms that need to be stored in the term dictionary (LexiconRelation), then consider a custom implementation that you can write in a pipeline-like manner. Mapping of terms to shortened identifiers has to happen before you can store in the statement indexes, but does not have to happen in a transaction (its append-only after all). Take a look at Jena TDB2, taking a hash-based approach moves the variable-length data outside the index structure so your index would be more balanced. If your code needs the ability to do ordered scans (if I recall correctly, range scan is only used for getting counts, which isn't hard to maintain externally), seriously consider whether or not you need unicode collation... if so, make sure to normalize the value before insertion into ID2TERM (collation removes control characters, but does not normalize the value before inserting into TERM2ID, so LexiconRelation is not a proper bijection).
  3. Reduce allocation of byte arrays, reuse from a pool where possible. The plain front coding impl allocates a bunch of them, which fills up the TLAB (uses a bump pointer), causing more frequent minor GCs (unless you're using a pauseless GC, it will still be a stop-the-world event). Copying from eden to one of the survivor spaces and toggling between them before promoting to tenured space is practically a guarantee that when accessing that object you will have a miss on L1 and a likely miss on L2, furthermore, you'll cause objects to age more quickly and be promoted to tenure.
  4. Java does not unroll loops unless it can use superwords (SSE, AVX, etc...). With PFC, in my case, Hotspot never optimized to use superwords... so prefix searches were always done by doing byte comparisons. If you're using a recent version of Java (past 8), you can use Arrays.prefix, not sure when it was actually added but in 11 it uses Hotspot intrinsics (I believe AVX is a requirement though).

Edited for clarity and mistakes.