FreeUKGen / Systemwide

repository for issues that affect all systems within the Free UK Gen portfolio
0 stars 0 forks source link

Retryable write with txnNumber 258 is prohibited on session #245

Closed Captainkirkdawson closed 2 years ago

Captainkirkdawson commented 2 years ago

We appear to have a communications issue somewhere in the mongodb server cluster that is creating a significant number of errors on errbit. These crashes are impacting users. Unfortunately I have no way of determining where and what is the root cause. Clearly one of the cluster is acting up They occur only with the freereg app and started 20 hrs ago. Both apps have essentially the same configuration. @lemon has commented are we sure it’s a communication issue?

could be configuration or bad concucrrency etc eg, early hit for that error message suggests we check out mongoid / passenger config: https://stackoverflow.com/questions/68806406/how-to-fix-mongoerroroperationfailure https://docs.mongodb.com/mongoid/current/tutorials/mongoid-configuration/#passenger

lemon-ukgen commented 2 years ago

With further context I don't think it's related to the configuration issue in the linked SO article. Please ignore that.

I can't find any sign of a network issue inside the MongoDB cluster or between the application servers and the cluster.

Captainkirkdawson commented 2 years ago

Issue is causing hundreds of search crashes daily on FreeREG; this is critical issue. 224604696 Website Problem (Pete Daniels) #1463

Captainkirkdawson commented 2 years ago

Hundreds of crashes are continuing on FreeREG on 14 Feb

Captainkirkdawson commented 2 years ago

@Vino-S @lemon-ukgen I remain convinced that this remains a mongodb1 server issue. One way of confirming rejecting this would be to have one of the other servers take over as primary. Currently mongodb1 will be primary in any election as it has a priority of 5. See https://drive.google.com/drive/u/1/folders/0BykgQKwJtk6faHhUWnRsWklNWWc?resourcekey=0-UVGgj7TtT9GHRo0KHiMVxA If this was reset to 3 and we ask mongodb1 to step down then mongodb2 which is in a different data centre in Germany would take over I leave it to you to try if you think appropriate

lemon-ukgen commented 2 years ago

I have reminded myself how priority works!

I can see @Vino-S is working on the MongoDB cluster and I won't take any action unless asked. But I think I now know how to :)

I extended mongo-shard-status to include priority bits of rs.conf() at the end:

[lemon@mongodb1 ~]$ /root/bin/mongo-shard-status
{
  "shard": "cluster1",
  "member": {
    "name": "mongodb1.internal.freeukgen.org.uk:27017",
    "role": "PRIMARY",
    "syncsource": "",
    "optime": "Timestamp(1644873046, 8)"
  }
}
{
  "shard": "cluster1",
  "member": {
    "name": "mongodb2.internal.freeukgen.org.uk:27017",
    "role": "SECONDARY",
    "syncsource": "mongodb1.internal.freeukgen.org.uk:27017",
    "optime": "Timestamp(1644873046, 8)"
  }
}
{
  "shard": "cluster1",
  "member": {
    "name": "mongodb3.internal.freeukgen.org.uk:27017",
    "role": "SECONDARY",
    "syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
    "optime": "Timestamp(1644873046, 8)"
  }
}
{
  "shard": "cluster1",
  "member": {
    "name": "mongodb4.internal.freeukgen.org.uk:27017",
    "role": "SECONDARY",
    "syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
    "optime": "Timestamp(1644873046, 8)"
  }
}
{
  "shard": "cluster1",
  "member": {
    "name": "mongodb5.internal.freeukgen.org.uk:27017",
    "role": "SECONDARY",
    "syncsource": "mongodb2.internal.freeukgen.org.uk:27017",
    "optime": "Timestamp(1644873046, 8)"
  }
}
{
  "host": "mongodb1.internal.freeukgen.org.uk:27017",
  "priority": 5,
  "votes": 1
}
{
  "host": "mongodb2.internal.freeukgen.org.uk:27017",
  "priority": 4,
  "votes": 1
}
{
  "host": "mongodb3.internal.freeukgen.org.uk:27017",
  "priority": 1,
  "votes": 1
}
{
  "host": "mongodb4.internal.freeukgen.org.uk:27017",
  "priority": 3,
  "votes": 1
}
{
  "host": "mongodb5.internal.freeukgen.org.uk:27017",
  "priority": 2,
  "votes": 1
}
Captainkirkdawson commented 2 years ago

@lemon-ukgen @Vino-S It is a simple change cfg = rs.conf() cfg.members[0].priority = 3 rs.reconfig(cfg) rs.stepDown()

Captainkirkdawson commented 2 years ago

Vino reports that the change made no difference confirming it is not a server malfunction but something we are now doing.

Captainkirkdawson commented 2 years ago

I have done an analysis of the onset of the issue and looked at some of our settings that have not been changed. First failure of this occurred on howler with REG at Feb 07 05:57:25 few ; Last software update 2022-02-07T05:18:48+00:00 ; secondary_preferred 4,5,1,2,3 First failure of this occurred on drill with REG at Feb 07 09:21:41 minimal ; Last software update 2022-02-07T05:24:40 ; secondary_preferred 3,4,5,1,2 First failure of this occurred on saki with REG at Feb 07 10:45:44 very few ; Last software update 2022-02-07T05:10:07+00:00 ; primary_preferred 5,1,2,3,4 First failure of this occurred on colobus with REG at Feb 07 11:23:18 thousands ; Last software update 2022-02-07T05:31:12+00:00 ; primary_preferred 1,2,3,4,5 First failure of this occurred on brazza with REG at Feb 07 12:13:52 very few ; Last software update 2022-02-07T05:15:43+00:00 ; secondary_preferred 2,3,4,5,1

First failure of this occurred on colobus with CEN at Feb 06 22:45:17 very very few ; Last software update 2022-02-10T16:15:04+00:00 ; primary_preferred 1,2,3,4,5 ; Do not know previous update time No failure on drill with CEN ; Last software update 2022-02-05T23:04:46+00:00 ; secondary_preferred 3,4,5,1,2 No failure on brazza with CEN ; Last software update 2022-02-11T01:54:05+00:00 ; secondary_preferred 2,3,4,5,1 No failure on saki with CEN ; Last software update 2022-02-05T22:57:03+00:00 ; secondary_preferred 5,2,3,4,1 howler not reporting errors on CEN ; Last software update 2022-02-05T23:04:46+00:00 ; secondary_preferred 4,5,1,2,3

Captainkirkdawson commented 2 years ago

It seems that the issue started after the deployment of 2022-02-07T05:10:07+00:00. It affects primarily reg and occasionally cen. It has not affected saki and drill on cen which are both running the earlier code base. It is not affecting brazza on cen even though updated however brazza is running secondary preferred. The vast majority of the errors involve reading search results after writing the search results; these are caught and only seen by a log analysis and the end user they are not logged to errbit. Those being logged to errbit are an eccelctic selection of reads from a variety of collections eg freereg_records but almost all occur during a write. Colobus is the administrator server and therefore is doing more writes to all collections and reg is far more active with writes than cen

Captainkirkdawson commented 2 years ago

I have changed the read preference on colobus to secondary_preferred but still 1,2,3,4,5. Will evaluate

Captainkirkdawson commented 2 years ago

Continues to occur. Looking at the log. FREEREG:SEARCH: Search 620c3820f4040b182f2b9586 had a problem Retryable write with txnNumber 42 is prohibited on session 7fd511bb-daa3-4ffb-8620-43b35e517e53 - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= because a newer retryable write with txnNumber 1175 has already started on this session. (225) (on mongodb2.internal.freeukgen.org.uk, modern retry, attempt 1)

Captainkirkdawson commented 2 years ago

Looking at errbit errors https://github.com/FreeUKGen/MyopicVicar/blob/master/app/controllers/freereg_contents_controller.rb#L33 is the action that seems to cause a number of these issues but why? It has been in use that why for 3 years!! https://github.com/FreeUKGen/MyopicVicar/blob/master/app/controllers/open_controller.rb#L55 is another save of a set of user submitted parameters. It is 4 years old! Perhaps it is associated with very different retryable writes for different collections being started while the first was still underway. Since it appears it every thing is being done within the same session multiple different collection writes would appear to be problematic. But why now???????? It is now beyond me. Cannot see we have changed anything that would start this. But the analysis says we did. @lemon-ukgen you didn't change passenger or anything else at that time perchance? @Vino-S perhaps you try setting RetryableWrite flag to false as suggested.

lemon-ukgen commented 2 years ago

@Captainkirkdawson I've made no changes to either Apache/Passenger or MongoDB servers this year.

Maybe worth checking out that How to fix: Mongo::Error::OperationFailure link above after all!

The "further context", for ignoring it, was the extra info that there had been a recent code change... but if we're satisfied that it's not down to a code change then I'd say its worth a look, if only to rule it out. Concurrency, overlapping transactions, forking servers... sounds expedient to re-connect each child individually.

Captainkirkdawson commented 2 years ago

@lemon-ukgen I was pretty sure there had been no changes to apache/passenger but had to check. WRT to your suggestion I do not understand where that code would go. All we have in config.ru is

require ::File.expand_path('../config/environment', FILE) run MyopicVicar::Application

Can you discus with @Vino-S and lets give it a whirl

lemon-ukgen commented 2 years ago

I think it should go between the require and the run line, but ... I'm no Rails developer!

Something like:

require ::File.expand_path('../config/environment',  __FILE__)

if defined?(PhusionPassenger)
  PhusionPassenger.on_event(:starting_worker_process) do |forked|
    Mongoid::Clients.clients.each do |name, client|
      client.close
      client.reconnect
    end
  end
end

run MyopicVicar::Application

Probably best try this on test2 / test3 first.

Vino-S commented 2 years ago

Implemented Lemon's suggestion in live for FreeREG. It's been an hour since the Colobus implementation. There has been no error reported in FreeREG since the implementation. Should we proceed with the deployment of the changes in CEN too?

Captainkirkdawson commented 2 years ago

@Vino-S Sorry to be dumb but you say you implemented Lemons suggestion on Freereg live for colobus. How? I see no commits to master. I see no change in config.ru on colobus. Ah I see it in environment.rb You just did colobus, saki, howler and drill but not brazza I see that in the log the FREEREG:SEARCH: Search had a problem Retryable write messages stopped after the last one at 2022-02-16 10:28:16 +0000 so it would appear that the change was effective for colobus similarly for saki and howler. brazza reported one at 2022-02-16 11:19:25 +0000 @Vino-S Seems to me that the change is effective; Should it be committed to master and deployed to all servers for both reg and cen? I should have listened @lemon-ukgen but I still do not understand why it only appeared after that last update. Back away Kirk!!!!

Vino-S commented 2 years ago

Thanks for the solution @lemon-ukgen Deployed the changes to live for both CEN and REG. @Captainkirkdawson ,: I still do not understand why it only appeared after that last update I am not sure about CEN, but I ran tmp:cache:clear in the last FreeREG deployment. May be this was the reason.

Captainkirkdawson commented 2 years ago

@Vino-S thanks. Since I have no idea of what tmp:cache:clear actually does and its implications; who knows. I have never ever done it so possibly

DeniseColbert commented 2 years ago

Done and fixed so closing