JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.28k stars 1.17k forks source link

Retry bulk request if all errors are configured retry error codes & Split large bulk requests #4489

Closed criminosis closed 3 months ago

criminosis commented 4 months ago

Closes #4488

Reused the retry logic from https://github.com/JanusGraph/janusgraph/pull/4409 to also apply to bulk requests if an inner bulk item request failed.


Thank you for contributing to JanusGraph!

In order to streamline the review of the contribution we ask you to ensure the following steps have been taken:

For all changes:

For code changes:

For documentation related changes:

criminosis commented 4 months ago

Had an idea for an improvement this morning to only retry the bulk items that returned an error. This is predicated upon the Bulk API docs stating the response's returned bulk item order is the same as the submission order.

The bulk API’s response contains the individual results of each operation in the request, returned in the order submitted. The success or failure of an individual operation does not affect other operations in the request.

Wasn't sure if that would be preferable though, so I left it as an unsquashed addition to the PR if it'd be preferable to revert it. If it looks good happy to squash it down to true up the PR for merging.

criminosis commented 4 months ago

Added a test for bulk retry in 81b07ac68.

While I was working on that though I noticed on the ES docs state there is a size limit for bulk requests:

There is no "correct" number of actions to perform in a single bulk request. Experiment with different settings to find the optimal size for your particular workload. Note that Elasticsearch limits the maximum size of a HTTP request to 100mb by default ...

I've bumped into this in my JG deployment, ES pushes back saying the Payload is too large. I didn't have a meaningful knob to turn from what I could tell so I opted to work around the issue by more aggressively splitting up calls on my JG client side. So since I was already working in the bulk request logic I implemented a simple request chunker.

If a bulk request has a payload that is too large for a single call, it'll split it up into multiple bulk request calls. Like last time keeping things unsquished for discussion if there is not a desire for the change.

criminosis commented 3 months ago

@li-boxuan / @porunov might interest you since you reviewed the last one? I accidentally left the job half finished. Didn't realize bulk request errors were nested into the item level.

criminosis commented 3 months ago

A very small BULK_CHUNK_SIZE_LIMIT_BYTES that would fail the request - demonstrating the we've encountered an element we cannot send to Elasticsearch given the configured limit scenario

@li-boxuan I don't think it's possible to write such a test because JanusGraph doesn't treat such a scenario as a failure in code or in documentation.

If the primary persistence into the storage backend succeeds but secondary persistence into the indexing backends or the logging system fail, the transaction is still considered to be successful because the storage backend is the authoritative source of the graph.

So the response from the traversal comes back as a success. I guess I could assert that a query that uses the mixed index stored in ES fails to return the vertex because now that index is blind to the property...but that feels like a weird thing to test for, albeit it is congruent with intended behavior.

I've bumped into this asymmetric situation when I can query using an equality predicate (thus uses the data backend's composite index) returns the expected vertex, but because the write to the mixed index failed (successfully) if you use like a range predicate over the same value, it won't find the vertex.

A relatively small BULK_CHUNK_SIZE_LIMIT_BYTES that would make the request split into chunks. We probably have no way to enforce that in the test (we lack observability) but I guess we could at least use IDE debugger to verify that manually.

Took a stab at this in https://github.com/JanusGraph/janusgraph/pull/4489/commits/af88b6a0a47140cc250f55af29f36574829d9754, set it as @Disabled so it doesn't contribute to the GH execution time given it has no assertions.

criminosis commented 3 months ago

@li-boxuan sounded like there wasn't anything objectionable with the additions, so I went ahead and squashed & signed off on the commit so the PR was back in a merge ready state again pending further feedback.

li-boxuan commented 3 months ago

I guess I could assert that a query that uses the mixed index stored in ES fails to return the vertex because now that index is blind to the property...

A much more cumbersome approach is to read the transaction recovery log... but I won't set this as acceptance criteria.

I do think what you mentioned here seems promising. Doesn't seem weird to me :) we probably already have tests doing something similar.

criminosis commented 3 months ago

@li-boxuan wrote writingAnItemLargerThanPermittedChunkLimitFails as requested to assert the expected data asymmetry between the composite and mixed index if an item exceeds the permitted bulk chunk limit.

An alternative idea I had was rather than abort the bulk write at the time that is observed would be to maybe still write the other elements that aren't in excess of the limit. Get in at least what could go in rather than have that data be lost too.

In other words, given a batch of mutations of an exceptionally large mutation followed by some number of small ones, fail the first, but don't throw an exception until later.

li-boxuan commented 3 months ago

In other words, given a batch of mutations of an exceptionally large mutation followed by some number of small ones, fail the first, but don't throw an exception until later.

i like that idea! Do you wanna open a follow-up PR for that 😉? Maybe a config to control ;)

criminosis commented 3 months ago

Yeah, that'd probably be best done as a followup PR.

criminosis commented 3 months ago

Happy to @li-boxuan. JanusGraph has been a critical piece to what I'm working on, hence why I've been bumping into these sharp edges 😅 .

Looking forward to getting a nightly build into my environment and put this change through its paces after whenever @porunov has time to give their review.

criminosis commented 3 months ago

@porunov I'd be happy to do this in the follow-up you mentioned. There's already a tentative PR in mind that I was talking about with @li-boxuan above, figured I'd get rolling on those after this merged.

Since you mentioned them as a follow-up is this PR valid for merging? I saw you tagged it for 1.1.0 so wasn't sure if that meant it was going to hang out unmerged for a period of time since there's also a live 1.0.1 milestone.

porunov commented 3 months ago

@criminosis Just merged this PR into ‘master’ branch. Thus, these changes will be available in 1.1.0 release and in the following “commit” releases. However, it won’t be backported to 1.0.1 release since it’s based on the feature which isn’t available in 1.0.0 release. If you want to port these changes to 1.0.1 release for any reason then feel free to open a PR targeting ‘v1.0’ branch with these changes and your previous changes.

criminosis commented 3 months ago

Thanks @porunov! Having it 1.1.0 and the next commit release is perfectly fine. I was asking just as curiosity around merging logistics.

Excited to get this into my environment soon as I've been bumping into what prompted this PR fairly often.

I'll get rolling on the follow-ups. 👍