Closed Captainkirkdawson closed 2 years ago
With further context I don't think it's related to the configuration issue in the linked SO article. Please ignore that.
I can't find any sign of a network issue inside the MongoDB cluster or between the application servers and the cluster.
Issue is causing hundreds of search crashes daily on FreeREG; this is critical issue. 224604696 Website Problem (Pete Daniels) #1463
Hundreds of crashes are continuing on FreeREG on 14 Feb
@Vino-S @lemon-ukgen I remain convinced that this remains a mongodb1 server issue. One way of confirming rejecting this would be to have one of the other servers take over as primary. Currently mongodb1 will be primary in any election as it has a priority of 5. See https://drive.google.com/drive/u/1/folders/0BykgQKwJtk6faHhUWnRsWklNWWc?resourcekey=0-UVGgj7TtT9GHRo0KHiMVxA If this was reset to 3 and we ask mongodb1 to step down then mongodb2 which is in a different data centre in Germany would take over I leave it to you to try if you think appropriate
I have reminded myself how priority works!
I can see @Vino-S is working on the MongoDB cluster and I won't take any action unless asked. But I think I now know how to :)
I extended mongo-shard-status
to include priority bits of rs.conf()
at the end:
[lemon@mongodb1 ~]$ /root/bin/mongo-shard-status
{
"shard": "cluster1",
"member": {
"name": "mongodb1.internal.freeukgen.org.uk:27017",
"role": "PRIMARY",
"syncsource": "",
"optime": "Timestamp(1644873046, 8)"
}
}
{
"shard": "cluster1",
"member": {
"name": "mongodb2.internal.freeukgen.org.uk:27017",
"role": "SECONDARY",
"syncsource": "mongodb1.internal.freeukgen.org.uk:27017",
"optime": "Timestamp(1644873046, 8)"
}
}
{
"shard": "cluster1",
"member": {
"name": "mongodb3.internal.freeukgen.org.uk:27017",
"role": "SECONDARY",
"syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
"optime": "Timestamp(1644873046, 8)"
}
}
{
"shard": "cluster1",
"member": {
"name": "mongodb4.internal.freeukgen.org.uk:27017",
"role": "SECONDARY",
"syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
"optime": "Timestamp(1644873046, 8)"
}
}
{
"shard": "cluster1",
"member": {
"name": "mongodb5.internal.freeukgen.org.uk:27017",
"role": "SECONDARY",
"syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
"optime": "Timestamp(1644873046, 8)"
}
}
{
"host": "mongodb1.internal.freeukgen.org.uk:27017",
"priority": 5,
"votes": 1
}
{
"host": "mongodb2.internal.freeukgen.org.uk:27017",
"priority": 4,
"votes": 1
}
{
"host": "mongodb3.internal.freeukgen.org.uk:27017",
"priority": 1,
"votes": 1
}
{
"host": "mongodb4.internal.freeukgen.org.uk:27017",
"priority": 3,
"votes": 1
}
{
"host": "mongodb5.internal.freeukgen.org.uk:27017",
"priority": 2,
"votes": 1
}
@lemon-ukgen @Vino-S It is a simple change cfg = rs.conf() cfg.members[0].priority = 3 rs.reconfig(cfg) rs.stepDown()
Vino reports that the change made no difference confirming it is not a server malfunction but something we are now doing.
I have done an analysis of the onset of the issue and looked at some of our settings that have not been changed. First failure of this occurred on howler with REG at Feb 07 05:57:25 few ; Last software update 2022-02-07T05:18:48+00:00 ; secondary_preferred 4,5,1,2,3 First failure of this occurred on drill with REG at Feb 07 09:21:41 minimal ; Last software update 2022-02-07T05:24:40 ; secondary_preferred 3,4,5,1,2 First failure of this occurred on saki with REG at Feb 07 10:45:44 very few ; Last software update 2022-02-07T05:10:07+00:00 ; primary_preferred 5,1,2,3,4 First failure of this occurred on colobus with REG at Feb 07 11:23:18 thousands ; Last software update 2022-02-07T05:31:12+00:00 ; primary_preferred 1,2,3,4,5 First failure of this occurred on brazza with REG at Feb 07 12:13:52 very few ; Last software update 2022-02-07T05:15:43+00:00 ; secondary_preferred 2,3,4,5,1
First failure of this occurred on colobus with CEN at Feb 06 22:45:17 very very few ; Last software update 2022-02-10T16:15:04+00:00 ; primary_preferred 1,2,3,4,5 ; Do not know previous update time No failure on drill with CEN ; Last software update 2022-02-05T23:04:46+00:00 ; secondary_preferred 3,4,5,1,2 No failure on brazza with CEN ; Last software update 2022-02-11T01:54:05+00:00 ; secondary_preferred 2,3,4,5,1 No failure on saki with CEN ; Last software update 2022-02-05T22:57:03+00:00 ; secondary_preferred 5,2,3,4,1 howler not reporting errors on CEN ; Last software update 2022-02-05T23:04:46+00:00 ; secondary_preferred 4,5,1,2,3
It seems that the issue started after the deployment of 2022-02-07T05:10:07+00:00. It affects primarily reg and occasionally cen. It has not affected saki and drill on cen which are both running the earlier code base. It is not affecting brazza on cen even though updated however brazza is running secondary preferred. The vast majority of the errors involve reading search results after writing the search results; these are caught and only seen by a log analysis and the end user they are not logged to errbit. Those being logged to errbit are an eccelctic selection of reads from a variety of collections eg freereg_records but almost all occur during a write. Colobus is the administrator server and therefore is doing more writes to all collections and reg is far more active with writes than cen
I have changed the read preference on colobus to secondary_preferred but still 1,2,3,4,5. Will evaluate
Continues to occur. Looking at the log. FREEREG:SEARCH: Search 620c3820f4040b182f2b9586 had a problem Retryable write with txnNumber 42 is prohibited on session 7fd511bb-daa3-4ffb-8620-43b35e517e53 - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= because a newer retryable write with txnNumber 1175 has already started on this session. (225) (on mongodb2.internal.freeukgen.org.uk, modern retry, attempt 1)
Looking at errbit errors https://github.com/FreeUKGen/MyopicVicar/blob/master/app/controllers/freereg_contents_controller.rb#L33 is the action that seems to cause a number of these issues but why? It has been in use that why for 3 years!! https://github.com/FreeUKGen/MyopicVicar/blob/master/app/controllers/open_controller.rb#L55 is another save of a set of user submitted parameters. It is 4 years old! Perhaps it is associated with very different retryable writes for different collections being started while the first was still underway. Since it appears it every thing is being done within the same session multiple different collection writes would appear to be problematic. But why now???????? It is now beyond me. Cannot see we have changed anything that would start this. But the analysis says we did. @lemon-ukgen you didn't change passenger or anything else at that time perchance? @Vino-S perhaps you try setting RetryableWrite flag to false as suggested.
@Captainkirkdawson I've made no changes to either Apache/Passenger or MongoDB servers this year.
Maybe worth checking out that How to fix: Mongo::Error::OperationFailure link above after all!
The "further context", for ignoring it, was the extra info that there had been a recent code change... but if we're satisfied that it's not down to a code change then I'd say its worth a look, if only to rule it out. Concurrency, overlapping transactions, forking servers... sounds expedient to re-connect each child individually.
@lemon-ukgen I was pretty sure there had been no changes to apache/passenger but had to check. WRT to your suggestion I do not understand where that code would go. All we have in config.ru is
require ::File.expand_path('../config/environment', FILE) run MyopicVicar::Application
Can you discus with @Vino-S and lets give it a whirl
I think it should go between the require
and the run
line, but ... I'm no Rails developer!
Something like:
require ::File.expand_path('../config/environment', __FILE__)
if defined?(PhusionPassenger)
PhusionPassenger.on_event(:starting_worker_process) do |forked|
Mongoid::Clients.clients.each do |name, client|
client.close
client.reconnect
end
end
end
run MyopicVicar::Application
Probably best try this on test2
/ test3
first.
Implemented Lemon's suggestion in live for FreeREG. It's been an hour since the Colobus implementation. There has been no error reported in FreeREG since the implementation. Should we proceed with the deployment of the changes in CEN too?
@Vino-S Sorry to be dumb but you say you implemented Lemons suggestion on Freereg live for colobus. How? I see no commits to master. I see no change in config.ru on colobus. Ah I see it in environment.rb You just did colobus, saki, howler and drill but not brazza I see that in the log the FREEREG:SEARCH: Search had a problem Retryable write messages stopped after the last one at 2022-02-16 10:28:16 +0000 so it would appear that the change was effective for colobus similarly for saki and howler. brazza reported one at 2022-02-16 11:19:25 +0000 @Vino-S Seems to me that the change is effective; Should it be committed to master and deployed to all servers for both reg and cen? I should have listened @lemon-ukgen but I still do not understand why it only appeared after that last update. Back away Kirk!!!!
Thanks for the solution @lemon-ukgen
Deployed the changes to live for both CEN and REG.
@Captainkirkdawson ,: I still do not understand why it only appeared after that last update
I am not sure about CEN, but I ran tmp:cache:clear
in the last FreeREG deployment. May be this was the reason.
@Vino-S thanks. Since I have no idea of what tmp:cache:clear actually does and its implications; who knows. I have never ever done it so possibly
Done and fixed so closing
We appear to have a communications issue somewhere in the mongodb server cluster that is creating a significant number of errors on errbit. These crashes are impacting users. Unfortunately I have no way of determining where and what is the root cause. Clearly one of the cluster is acting up They occur only with the freereg app and started 20 hrs ago. Both apps have essentially the same configuration. @lemon has commented are we sure it’s a communication issue?
could be configuration or bad concucrrency etc eg, early hit for that error message suggests we check out mongoid / passenger config: https://stackoverflow.com/questions/68806406/how-to-fix-mongoerroroperationfailure https://docs.mongodb.com/mongoid/current/tutorials/mongoid-configuration/#passenger