Closed leomkkwan closed 6 years ago
@tschottdorf Pattern matching on the ResolveIntent
messages. Feel free to reassign.
Thanks for your report, @leomkkwan! Based on this line
storage/queue.go:656 [gc,n3,s15,r2086/6:/Table/61/1/{196359â¦-205744â¦}] context timed out during local key range processing after: context deadline exceeded I171012 14:43:29.976033
it seems that you're hitting some variant of the problem that is being addressed in (a follow-up to) https://github.com/cockroachdb/cockroach/pull/19093.
Would you be able to share with us the workload that was running on this cluster before it got into this state? I have been able to reproduce such badness in my testing, but only with internal load generators, never with actual user data.
Also, mind opening http[s]://MVCC Intent Bytes/Count
row and see whether those numbers decrease after ~10min. If they do, then that's an indication that the clogging in your cluster is (potentially slowly) resolving itself.
I was trying to migrate a large table from mysql into CRDB. Basically it was a lot of multiple rows insert statment with on conflict do nothing. like Insert into Table (ID,b,c,d) values ((1,2,3,4),(4,5,6,7)......x1000 ) ON CONFLICT(ID) DO NOTHING;
ID is the primary key.
The error message are still getting in the log file, even I've stopped the import query for a two days, with differernt range. Every time I try to load the report for that range in the log, the "MVCC Intent Bytes/Count" is always 0.
The cluster only comes with spinning harddrive, and kv.raft_log.synchronize setting have been set to false. However, when I was running the insert query, there was no power failure or any other problem. Not sure if that will make a different.
Thanks for looking in to this.
Leo
Sorry for not getting back to you earlier, @leomkkwan. The MVCC System Bytes
is pretty high (~15mb) and it looks like my initial guess was correct. Essentially, there are a lot of transaction records that must be removed, but handling them takes longer than it should.
While you're importing, do you see any errors in your client? With an import running, could you send me graphs from the /#/cluster/all/distributed
ui page (while hovering over the kv transactions
graph)?
And, to double check, you're inserting in batches of 1000 rows each, right?
Insert into Table (ID,b,c,d) values ((1,2,3,4),(4,5,6,7)......x1000 ) ON CONFLICT(ID) DO NOTHING;
Don't want to hijack this thread, but i'm having what i believe is the same issue. 3 node cluster, all running v 1.1.1
E171024 21:42:05.398288 51 storage/queue.go:656 [gc,n4,s4,r163/7:/Table/138{-/1/-307…}] context timed out during local key range processing after: context deadline exceeded
although in my case it's only one of the 3 nodes that's 100% cpu and throwing these errors, about 1 a minute, the other two seem to be un-affected.
I seemed to have caused this issue a different way, by adding new nodes and decommission an old ones. But my question is does this eventually "catchup" or what do i do to fix this? It's causing Execution Latency's upwards of 15 seconds right now on the bad node.
@NathanRThomas not a bother at all, thanks for posting! It's not likely that adding and decommissioning nodes is the root cause. Would you be able to share your workload (i.e. your table schema and perhaps the way you're inserting data into it?
Assuming that the problem is still occurring by the time you read this, would you be able to
cockroach debug estimate-gc <path-to-your-data-directory>
Thanks for sending the data along, @NathanRThomas. I'm seeing the following for range 137 (the one from your log message above):
RangeID: 163 [/Table/138, /Table/138/1/-3072139):
storage.GCInfo{
Now: hlc.Timestamp{WallTime:1508940258506058366, Logical:0},
Policy: config.GCPolicy{TTLSeconds:86400},
NumKeysAffected: 1872,
IntentsConsidered: 0,
IntentTxns: 0,
TransactionSpanTotal: 4,
TransactionSpanGCAborted: 2,
TransactionSpanGCCommitted: 2,
TransactionSpanGCPending: 0,
TxnSpanGCThreshold: hlc.Timestamp{WallTime:1508936658506058366, Logical:0},
AbortSpanTotal: 0,
AbortSpanConsidered: 0,
AbortSpanGCNum: 0,
PushTxn: 0,
ResolveTotal: 86533,
ResolveSuccess: 86533,
Threshold: hlc.Timestamp{WallTime:1508853858506058366, Logical:0},
}
This indicates that the garbage collection considers four transaction records for removal, but racks up an impressive 86533 intents it has to remove first! Clearly we don't handle that case well and we'll work on improving the code (actually we have already, but it hasn't landed yet (https://github.com/cockroachdb/cockroach/pull/18643 and https://github.com/cockroachdb/cockroach/pull/18661).
I wonder how this situation came to be - is this perhaps a table that you dropped or truncated? What's the output of select * from crdb_internal.tables where table_id = 138;
?
ok, so... dumb question, what's an intent? is that an update? table 138 is our users table, which is the most used/updated table.
I was actually trying to be really good about not updating that table too much as i caused this issue 2 weeks ago https://github.com/cockroachdb/cockroach/issues/18199 when i was trying to back-populate a new column in the users table. Ended up doing a backup and creating a new cluster.
is this something that will eventually resolve? or what should i do right now to fix this?
An intent is what a transactional write writes. Intents are typically removed (if the corresponding transaction is aborted) or turned into values (if the txn commits). Due to the distributed nature of the system, sometimes intents are left behind and have to be cleaned up later (though this should be rare!) and it seems that in this case a lot of intents have accumulated. In fact, so many that our GC queue runs out of time collecting everything before it starts removing anything!
This isn't your fault, we just need to handle this situation better. First of all, I'm not sure how that many intents actually ended up being left over. Perhaps you can share the queries that usually mutate this table (obviously you might prefer not posting those publicly)?
In the meantime, I'll try to whip up a "quick fix" that you can run that hopefully allows some progress to be made.
If you're feeling lucky, try running a SELECT COUNT(*)
on this table (mostly because I'm curious whether it'll return) -- but be advised that this may make the cluster even more unhappy.
Actually, something else that may be helpful is if you looked at
http://yourui:yourport/debug/requests?fam=tracing&b=6&exp=1
and sent me that page. It should contain detailed information on the execution of the GC process. You should use the ip:port of the node that has 100% CPU at the time of checking.
If you're unable to access that page, use
SET CLUSTER SETTING server.remote_debugging.mode = 'any';
but be advised that this makes this kind of information accessible to anyone who can access the ui. You may want to reset it to its default value 'local' after.
+----------+ | count(*) | +----------+ | 360090 | +----------+ (1 row)
Time: 10.923034528s
ok, so I did the url for all of the nodes actually. Sorry i'm not sure what exactly i'm looking for. That exact url doesn't really show anything. Clicking around, all of them only had "1 active" and it was "6560.800944 server shutdown" or about that many seconds, which relates to this morning when the nodes were failing. none of them had anything under the errors section,
the last schema change to that table was 7 days ago, i dropped a column and an index on that column. As far as transactions, with the cluster failing this morning i had more than a few errors with connectivity to the cluster, which caused, i'm sure, a lot of failed and abandoned transactions. Not sure if the high intent count was the cause or a symptom.
2 of the nodes have high cpu at the moment, uptimes are below 0.45 | 1.78 | 1.06 | 0.08
only the node with 1.78 is throwing erros at the moment. E171025 15:24:27.625048 139 storage/queue.go:656 [gc,n5,s5,r163/5:/Table/138{-/1/-307…}] context timed out during local key range processing after: context deadline exceeded
Apologies! You also need to run this:
set cluster setting trace.debug.enable = true;
which will cause these traces to be collected in the first place.
Interesting that the SELECT
passes and you still see the issue. I would've expected both to have the same kind of problem!
ok, so i ran that count(*) on the bad node, and it doesn't return. I did it on a node that wasn't having issues before when it worked.
i ran that cluster setting. looks like it's recording stuff now.
Ok, I'm fairly certain what's happening is that there is an old transaction record that contains references to a lot of intents. Unfortunately the way the cleanup of that is set up today is that it won't remove this record until it has cleaned up all of the intents referenced. Since there are so many referenced intents, the cleanup times out. As a result, the next time the record is inspected, all of the references are inspected again (even though some of them were already dealt with last time).
I'll send you an updated binary that avoids this problem.
You want a linux-amd64 binary, right? Or can you build yourself (if you can, that's always preferable because then you don't have to trust my binary).
On Wed, Oct 25, 2017 at 1:44 PM NateDogg notifications@github.com wrote:
ok, so i ran that count(*) on the bad node, and it doesn't return. I did it on a node that wasn't having issues before when it worked.
i ran that cluster setting. looks like it's recording stuff now.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/19216#issuecomment-339412597, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135DlZd3YR-zJGhoCgHRWvlo9Z9LK2ks5sv3OXgaJpZM4P3Ib8 .
--
-- Tobias
ya a build linux-amd64 would be great, thanks. you seem like a good guy, i trust ya ;-)
Sent you a binary with https://github.com/cockroachdb/cockroach/pull/19538 on our 1.1 release branch (i.e. the branch that will be 1.1.2 in a few weeks).
omg! i think that fixed it. Can't thank you enough!
Glad to hear, I'll leave this open until #19538 becomes a real PR and merges.
Hi, I've took down the v1.1c luster, remove all data and start a new v1.2-alpha.20171026 cluster with 4 machines.
Ran for some test with a lot of insert on duplicate update statements, and the context deadline exceeded error pops, but with different message.
[replica consistency checker,n1,s5,r8428/3:/Table/68/1/2{176744…-200186…}] could not CollectChecksum from replica (n2,s6):5: rpc error: code = Unknown desc = while waiting for compute checksum (ID = e18c365d-8e41-492e-b636-5a5c17a7c326): context deadline exceeded storage/replica_command.go:2337
All 4 machines with High CPU load right now and everything else seems to work.
Hi @leomkkwan, sorry -- didn't see this until now. This error is benign but indicates that the system is working hard. Is this a persistent problem?
Yes, it looks like it is a persistent problem. I've upgrade to v1.2-alpha.20171113, the CPU is still under load, even the cluster is idle for more than two weeks.
Here is some log entries. The "node already has a replica" message is repeating and I can see the messages are repeated for the same range.
In my cluster, each physical node have 5 stores which is 5 seperate HDD drive.
E171120 17:26:51.465572 29951229 storage/replica_command.go:2349 [replica consistency checker,n1,s4,r2607/4:/Table/58/1/91{35606â¦-44429â¦}] could not CollectChecksum from replica (n2,s10):6: rpc error: code = DeadlineExceeded desc = co ntext deadline exceeded E171120 17:26:51.662339 1154 storage/queue.go:656 [replicate,n1,s5,r5905/13:/Table/59/1/1011{4416â¦-9240â¦}] [n1,s5,r5905/13:/Table/59/1/1011{4416â¦-9240â¦}]: unable to add replica (n1,s1):?; node already has a replica E171120 17:26:51.670281 832 storage/queue.go:656 [replicate,n1,s4,r23/14:/Table/51/1/2889700{138â¦-493â¦}] [n1,s4,r23/14:/Table/51/1/2889700{138â¦-493â¦}]: unable to add replica (n4,s19):?; node already has a replica E171120 17:26:51.674685 552 storage/queue.go:656 [replicate,n1,s2,r7110/4:/Table/61/3/"b{cca7fâ¦-de2e6â¦}] [n1,s2,r7110/4:/Table/61/3/"b{cca7fâ¦-de2e6â¦}]: unable to add replica (n2,s7):?; node already has a replica E171120 17:26:51.725885 367 storage/queue.go:656 [replicate,n1,s1,r10627/3:/Table/73/1/87{35231â¦-60983â¦}] [n1,s1,r10627/3:/Table/73/1/87{35231â¦-60983â¦}]: unable to add replica (n2,s7):?; node already has a replica E171120 17:26:51.858363 1154 storage/queue.go:656 [replicate,n1,s5,r2271/1:/Table/58/1/70{18765â¦-27590â¦}] [n1,s5,r2271/1:/Table/58/1/70{18765â¦-27590â¦}]: unable to add replica (n2,s7):?; node already has a replica
Updated.. I also see an interesting log entry on only one node out of my three nodes cluster. Where the RaftLogSelfBehindCount is growing up slowly from 1576 to 1885 in around four and a half hour.
I171120 12:59:37.919663 2071 storage/store.go:4167 [n2,s7] temporarily disabling rebalances because RaftLogSelfBehindCount=1576
I171120 17:30:28.916415 2071 storage/store.go:4167 [n2,s7] temporarily disabling rebalances because RaftLogSelfBehindCount=1885
@leomkkwan sorry for having this drop off my radar. Are you still running this cluster?
The cluster is still exists, but upgrade to v2.0. The original issue of context deadline exceeded error is not exist anymore.
Thanks @leomkkwan. Closing this out then.
I have a 4 nodes cluster running CockroachDB CCL v1.1.0-rc.1 (linux amd64, built 2017/10/05 15:33:25, go1.8.3). They are all on their own physical machine running CRDB only.
All 4 machines are consuming 25% to 50% CPU and this cluster is idle for more than 24 hours. I've tried to restart the cluster but it turns out it did not solve the problem.
I've check the log, and these error keep coming in for all 4 machines.
I171012 14:43:09.975888 699 server/status/runtime.go:223 [n3] runtime stats: 4.0 GiB RSS, 598 goroutines, 414 MiB/112 MiB/585 MiB GO alloc/idle/total, 2.9 GiB/3.3 GiB CGO alloc/total, 299611.01cgo/sec, 6.37/0.15 %(u/s)time, 0.00 %gc (16x) I171012 14:43:12.029415 273 storage/replica_proposal.go:453 [n3,s11,r4628/5:/Table/87/1/404{55634â¦-88319â¦}] new range lease repl=(n3,s11):5 start=1507819392.026966287,0 epo=97 pro=1507819392.026989217,0 following repl=(n1,s3):3 start=1507769912.947129817,0 epo=559 pro=1507770183.468698113,0 I171012 14:43:13.079773 337994 vendor/google.golang.org/grpc/grpclog/grpclog.go:75 grpc: Server.processUnaryRPC failed to write status stream error: code = DeadlineExceeded desc = "context deadline exceeded" I171012 14:43:13.080532 338023 vendor/google.golang.org/grpc/grpclog/grpclog.go:75 grpc: Server.processUnaryRPC failed to write status stream error: code = DeadlineExceeded desc = "context deadline exceeded" W171012 14:43:13.081687 337996 storage/replica.go:2696 [n3,s11,r2048/5:/Table/61/2/2{500125â¦-625237â¦}] context cancellation after 0.0s of attempting command ResolveIntent [/Table/61/2/25662301/106019811/0,/Min), ResolveIntent [/Table/61/2/25684711/106016581/0,/Min), ResolveIntent [/Table/61/2/25684711/106016591/0,/Min), ResolveIntent [/Table/61/2/25684711/106016601/0,/Min), ResolveIntent [/Table/61/2/25757361/106015861/0,/Min), ResolveIntent [/Table/61/2/25907091/106018541/0,/Min), ResolveIntent [/Table/61/2/26039291/106019751/0,/Min) I171012 14:43:20.034854 699 server/status/runtime.go:223 [n3] runtime stats: 4.0 GiB RSS, 603 goroutines, 420 MiB/104 MiB/585 MiB GO alloc/idle/total, 3.0 GiB/3.4 GiB CGO alloc/total, 219366.29cgo/sec, 7.23/0.15 %(u/s)time, 0.00 %gc (15x) W171012 14:43:26.986427 373144 storage/replica.go:2141 [n3,s13,r2087/8:/Table/61/2/5{866797â¦-988729â¦}] context deadline exceeded while in command queue: ResolveIntent [/Table/61/2/59051631/198666021/0,/Min), ResolveIntent [/Table/61/2/59054441/198668111/0,/Min), ResolveIntent [/Table/61/2/59056971/198669501/0,/Min), ResolveIntent [/Table/61/2/59057761/198665251/0,/Min), ResolveIntent [/Table/61/2/59058331/198666491/0,/Min), ResolveIntent [/Table/61/2/59063091/198668031/0,/Min), ResolveIntent [/Table/61/2/59070351/198669121/0,/Min), ResolveIntent [/Table/61/2/59070361/198668711/0,/Min), ResolveIntent [/Table/61/2/59070611/198669081/0,/Min), ResolveIntent [/Table/61/2/59071551/198665731/0,/Min), ResolveIntent [/Table/61/2/59075711/198669221/0,/Min), ResolveIntent [/Table/61/2/59076291/198669551/0,/Min), ResolveIntent [/Table/61/2/59081811/198667731/0,/Min), ResolveIntent [/Table/61/2/59089291/198665671/0,/Min), ResolveIntent [/Table/61/2/59089591/198667801/0,/Min), ResolveIntent [/Table/61/2/59091411/198667521/0,/Min), ResolveIntent [/Table/61/2/59096981/198665991/0,/Min), ResolveIntent [/Table/61/2/59098411/198665291/0,/Min), ResolveIntent [/Table/61/2/59100391/198665131/0,/Min), ResolveIntent [/Table/61/2/59100391/198665971/0,/Min), ... 75 skipped ..., ResolveIntent [/Table/61/2/59153051/198669251/0,/Min), ResolveIntent [/Table/61/2/59153531/198666351/0,/Min), ResolveIntent [/Table/61/2/59153621/198667121/0,/Min), ResolveIntent [/Table/61/2/59153711/198668851/0,/Min), ResolveIntent [/Table/61/2/59154091/198666971/0,/Min) W171012 14:43:26.990270 373145 storage/replica.go:2141 [n3,s13,r2087/8:/Table/61/2/5{866797â¦-988729â¦}] context deadline exceeded while in command queue: ResolveIntent [/Table/61/2/59154401/198667871/0,/Min), ResolveIntent [/Table/61/2/59154441/198668381/0,/Min), ResolveIntent [/Table/61/2/59155031/198665841/0,/Min), ResolveIntent [/Table/61/2/59156621/198666691/0,/Min), ResolveIntent [/Table/61/2/59156961/198668141/0,/Min), ResolveIntent [/Table/61/2/59157101/198668271/0,/Min), ResolveIntent [/Table/61/2/59157641/198667421/0,/Min), ResolveIntent [/Table/61/2/59157811/198665171/0,/Min), ResolveIntent [/Table/61/2/59157811/198666331/0,/Min), ResolveIntent [/Table/61/2/59157811/198669451/0,/Min), ResolveIntent [/Table/61/2/59158481/198666011/0,/Min), ResolveIntent [/Table/61/2/59158501/198669931/0,/Min), ResolveIntent [/Table/61/2/59158591/198669431/0,/Min), ResolveIntent [/Table/61/2/59159981/198667541/0,/Min), ResolveIntent [/Table/61/2/59160571/198669131/0,/Min), ResolveIntent [/Table/61/2/59160711/198665561/0,/Min), ResolveIntent [/Table/61/2/59160721/198665341/0,/Min), ResolveIntent [/Table/61/2/59163421/198665211/0,/Min), ResolveIntent [/Table/61/2/59163971/198665491/0,/Min), ResolveIntent [/Table/61/2/59164371/198667791/0,/Min), ... 75 skipped ..., ResolveIntent [/Table/61/2/59185981/198667701/0,/Min), ResolveIntent [/Table/61/2/59186061/198667111/0,/Min), ResolveIntent [/Table/61/2/59186191/198665021/0,/Min), ResolveIntent [/Table/61/2/59186191/198665941/0,/Min), ResolveIntent [/Table/61/2/59186221/198665001/0,/Min) W171012 14:43:26.990841 373181 storage/replica.go:2141 [n3,s13,r2087/8:/Table/61/2/5{866797â¦-988729â¦}] context deadline exceeded while in command queue: ResolveIntent [/Table/61/2/58676911/198665391/0,/Min), ResolveIntent [/Table/61/2/58676911/198669741/0,/Min), ResolveIntent [/Table/61/2/58752931/198666631/0,/Min), ResolveIntent [/Table/61/2/58890441/198669411/0,/Min), ResolveIntent [/Table/61/2/58894011/198668351/0,/Min), ResolveIntent [/Table/61/2/58897301/198668901/0,/Min), ResolveIntent [/Table/61/2/58942861/198667621/0,/Min), ResolveIntent [/Table/61/2/58946141/198665141/0,/Min), ResolveIntent [/Table/61/2/58963081/198667661/0,/Min), ResolveIntent [/Table/61/2/58973451/198665441/0,/Min), ResolveIntent [/Table/61/2/58973451/198669561/0,/Min), ResolveIntent [/Table/61/2/58991571/198665091/0,/Min), ResolveIntent [/Table/61/2/59020051/198667261/0,/Min), ResolveIntent [/Table/61/2/59032791/198665911/0,/Min), ResolveIntent [/Table/61/2/59037201/198666281/0,/Min), ResolveIntent [/Table/61/2/59040771/198669321/0,/Min), ResolveIntent [/Table/61/2/59043061/198669781/0,/Min) W171012 14:43:27.018941 373146 storage/replica.go:2141 [n3,s13,r2087/8:/Table/61/2/5{866797â¦-988729â¦}] context deadline exceeded while in command queue: ResolveIntent [/Table/61/2/59186251/198669461/0,/Min), ResolveIntent [/Table/61/2/59186311/198666791/0,/Min), ResolveIntent [/Table/61/2/59186321/198666571/0,/Min), ResolveIntent [/Table/61/2/59186351/198667641/0,/Min), ResolveIntent [/Table/61/2/59186391/198665461/0,/Min), ResolveIntent [/Table/61/2/59186551/198669731/0,/Min), ResolveIntent [/Table/61/2/59186571/198667211/0,/Min), ResolveIntent [/Table/61/2/59186641/198668091/0,/Min), ResolveIntent [/Table/61/2/59186671/198666151/0,/Min), ResolveIntent [/Table/61/2/59186821/198667581/0,/Min), ResolveIntent [/Table/61/2/59186851/198669921/0,/Min), ResolveIntent [/Table/61/2/59186871/198668701/0,/Min), ResolveIntent [/Table/61/2/59186941/198669971/0,/Min), ResolveIntent [/Table/61/2/59187141/198668621/0,/Min), ResolveIntent [/Table/61/2/59187191/198668541/0,/Min), ResolveIntent [/Table/61/2/59187271/198669761/0,/Min), ResolveIntent [/Table/61/2/59187281/198669581/0,/Min), ResolveIntent [/Table/61/2/59187441/198667191/0,/Min), ResolveIntent [/Table/61/2/59187771/198668571/0,/Min), ResolveIntent [/Table/61/2/59187891/198665831/0,/Min), ... 75 skipped ..., ResolveIntent [/Table/61/2/59190001/198668871/0,/Min), ResolveIntent [/Table/61/2/59190011/198668981/0,/Min), ResolveIntent [/Table/61/2/59190061/198669031/0,/Min), ResolveIntent [/Table/61/2/59190071/198669051/0,/Min), ResolveIntent [/Table/61/2/59190081/198669061/0,/Min) W171012 14:43:27.020201 373139 storage/intent_resolver.go:498 [gc,n3,s15,r2086/6:/Table/61/1/{196359â¦-205744â¦}] unable to resolve external intents: context deadline exceeded W171012 14:43:27.020412 373140 storage/intent_resolver.go:498 [gc,n3,s15,r2086/6:/Table/61/1/{196359â¦-205744â¦}] unable to resolve external intents: context deadline exceeded W171012 14:43:27.020514 373138 storage/intent_resolver.go:498 [gc,n3,s15,r2086/6:/Table/61/1/{196359â¦-205744â¦}] unable to resolve external intents: context deadline exceeded E171012 14:43:27.113768 599 storage/queue.go:656 [gc,n3,s15,r2086/6:/Table/61/1/{196359â¦-205744â¦}] context timed out during local key range processing after: context deadline exceeded I171012 14:43:29.976033 699 server/status/runtime.go:223 [n3] runtime stats: 4.0 GiB RSS, 573 goroutines, 181 MiB/268 MiB/585 MiB GO alloc/idle/total, 2.9 GiB/3.4 GiB CGO alloc/total, 43747.90cgo/sec, 6.91/0.15 %(u/s)time, 0.00 %gc (15x)
Beside the high CPU load, everything seems to normal.