Management messages get lost

ngrigoriev commented 6 years ago

I am consistently observing the following problem: half of my JG instances are missing the management messages. This breaks the schema updates and, in particular, the index updates.

I have two instances, each running in its AZ (us-east-1a and us-east-1b). Both using the same JG 0.2.0 and DynamoDB backend. I am trying to enable an index and this operation times out. I have investigated the problem and found that one of the instances is not receiving the management messages. Another instance receives them and acknowledges the cache eviction message:

Good instance:

2018-03-26T20:25:55.261Z 20:25:55 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:49.755Z, 2018-03-26T20:25:54.760Z) 2018-03-26T20:25:55.261Z 20:25:55 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148-102- 50-120, end: 0- 5-104- 86-148-178-145- 64, limit:1024) 2018-03-26T20:26:00.266Z 20:26:00 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:54.760Z, 2018-03-26T20:25:59.765Z) 2018-03-26T20:26:00.266Z 20:26:00 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148-178-145- 64, end: 0- 5-104- 86-148-254-240- 8, limit:1024) 2018-03-26T20:26:05.271Z 20:26:05 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:59.765Z, 2018-03-26T20:26:04.770Z) 2018-03-26T20:26:05.271Z 20:26:05 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148-254-240- 8, end: 0- 5-104- 86-149- 75- 78-208, limit:1024) 2018-03-26T20:26:05.276Z 20:26:05 DEBUG org.janusgraph.diskstorage.log.kcvs.KCVSLog - Parsed message Message@2018-03-26T20:26:00.483Z:0a0f8a8d1-ip-10-15-138-141-eu-central-1-compute-internal2=128-129-130-104-133- 1- 90-137, about to submit this message to the reader executor 2018-03-26T20:26:05.286Z 20:26:05 DEBUG org.janusgraph.diskstorage.log.kcvs.KCVSLog - Preparing to write MessageEnvelope[message=FutureMessage[Message@2018-03-26T20:26:05.281Z:ac1100031-418de3956bb01=129-160- 48- 97- 48-102- 56- 97- 56-100- 49- 45-105-112- 45- 49- 48- 45- 49- 53- 45- 49- 51- 56- 45- 49- 52- 49- 45-101-117- 45- 99-101-110-116-114- 97-108- 45- 49- 45- 99-111-109-112-117-116-101- 45-105-110-116-101-114-110- 97-108-178-129],key= 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223,entry= 0- 5-104- 86-149- 83- 26-232-160- 97- 99- 49- 49- 48- 48- 48- 51- 49- 45- 52- 49- 56-100-101- 51- 57- 53- 54- 98- 98- 48-177- 0- 0- 0- 0- 0- 0- 0- 1->129-160- 48- 97- 48-102- 56- 97- 56-100- 49- 45-105-112- 45- 49- 48- 45- 49- 53- 45- 49- 51- 56- 45- 49- 52- 49- 45-101-117- 45- 99-101-110-116-114- 97-108- 45- 49- 45- 99-111-109-112-117-116-101- 45-105-110-116-101-114-110- 97-108-178-129] to storage with column/timestamp 2018-03-26T20:26:05.281Z 2018-03-26T20:26:05.287Z 20:26:05 DEBUG org.janusgraph.diskstorage.log.kcvs.KCVSLog - Built mutation on key 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223 with 1 additions 2018-03-26T20:26:05.292Z 20:26:05 DEBUG org.janusgraph.diskstorage.log.kcvs.KCVSLog - Wrote 1 total envelopes with operation timestamp 2018-03-26T20:26:05.286Z 2018-03-26T20:26:05.292Z 20:26:05 DEBUG org.janusgraph.diskstorage.log.kcvs.KCVSLog - Wrote 1 messages to backend 2018-03-26T20:26:10.277Z 20:26:10 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:26:04.770Z, 2018-03-26T20:26:09.777Z) 2018-03-26T20:26:10.278Z 20:26:10 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-149- 75- 78-208, end: 0- 5-104- 86-149-151-181-104, limit:1024)

Bad instance:

2018-03-26T20:25:51.294Z 20:25:51 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:45.789Z, 2018-03-26T20:25:50.794Z) 2018-03-26T20:25:51.295Z 20:25:51 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148- 41-174- 72, end: 0- 5-104- 86-148-118- 13- 16, limit:1024) 2018-03-26T20:25:56.300Z 20:25:56 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:50.794Z, 2018-03-26T20:25:55.799Z) 2018-03-26T20:25:56.300Z 20:25:56 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148-118- 13- 16, end: 0- 5-104- 86-148-194-107-216, limit:1024) 2018-03-26T20:26:01.305Z 20:26:01 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:25:55.799Z, 2018-03-26T20:26:00.804Z) 2018-03-26T20:26:01.305Z 20:26:01 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-148-194-107-216, end: 0- 5-104- 86-149- 14-202-160, limit:1024) 2018-03-26T20:26:06.310Z 20:26:06 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:26:00.804Z, 2018-03-26T20:26:05.810Z) 2018-03-26T20:26:06.310Z 20:26:06 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-149- 14-202-160, end: 0- 5-104- 86-149- 91- 45- 80, limit:1024) 2018-03-26T20:26:11.315Z 20:26:11 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:26:05.810Z, 2018-03-26T20:26:10.815Z) 2018-03-26T20:26:11.315Z 20:26:11 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-149- 91- 45- 80, end: 0- 5-104- 86-149-167-140- 24, limit:1024) 2018-03-26T20:26:16.321Z 20:26:16 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:26:10.815Z, 2018-03-26T20:26:15.820Z) 2018-03-26T20:26:16.322Z 20:26:16 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-149-167-140- 24, end: 0- 5-104- 86-149-243-234-224, limit:1024) 2018-03-26T20:26:21.326Z 20:26:21 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - MessagePuller time window: [2018-03-26T20:26:15.820Z, 2018-03-26T20:26:20.826Z) 2018-03-26T20:26:21.327Z 20:26:21 TRACE org.janusgraph.diskstorage.log.kcvs.KCVSLog - Converted MessagePuller time window to KeySliceQuery(key: 0- 0- 0- 0- 0- 0- 0- 0- 0-232- 64-223, start: 0- 5-104- 86-149-243-234-224, end: 0- 5-104- 86-150- 64- 77-144, limit:1024)

We did some testing and found that the problem seems to be consistent. E.g. regardless of what I do, the same instance will always receive the management messages and the same one will consistently miss them.

I am running JG instances via ECS (cluster of m4.large instances), with NTP enabled.

I have attempted to create an environment manually with 4 JG instances running in two AZs and that worked fine - all 4 were responding correctly.

I understand that the problem is somewhat borderline between JG and the storage but I hope to have better chances of getting help here, closer to the source :)

I have started looking at the code for KCVSLog and so far I have a feeling that it may be very sensitive to the time. So, in my scenario, we have 3 JG instances: two instances running in different AZs plus one more that performs the schema update. All of them are running NTP daemon but I have seen the clock drift on AWS VMs going quite far. I am trying not to jump to conclusion too quickly but so far the only theory I have is that some of my instances are somehow skipping these management messages because of KCVSLog message pulling logic. This theory is somewhat supported by the fact that the instance that "skips" these messages does CONSISTENTLY skip them, e.g. it never receives any management messages. And the "good" instance always does receive them.

ngrigoriev commented 6 years ago

Sorry, I think I ran to the conclusion a bit too fast. Finally found that the clock drift on the hosts was so significant, that it affected this logic :( Please mark this as a question and close, thanks.

amcp commented 6 years ago

@ngrigoriev I saw the analysis on the jg project and agree with the conclusions about drift. closing

amazon-archives / dynamodb-janusgraph-storage-backend

Management messages get lost #268