Azure / azure-postgresql

Azure Database for PostgreSQL (fully managed service for PostgreSQL in Azure)
MIT License
78 stars 78 forks source link

az postgres flexible-server upgrade InternalServerError #110

Open charalamm opened 1 year ago

charalamm commented 1 year ago

Hello,

I am trying to upgrade a postgres flexible-server from v11 to a newer version put I always get (InternalServerError) An unexpected error occured while processing the request. Tracking ID: '')

More specifically:

fsismondi commented 1 year ago

same issue here, support has been terribly bad at handling this case for us

jbarascut commented 1 year ago

Hello @charalamm I have the same issue with flexible-server from v11 to v14. Azure support doesn't help me

fsismondi commented 1 year ago

@charalamm have you managed to get some info about this?

charalamm commented 1 year ago

@fsismondi I have talked with support and they said it is because of a bug on their end. They said they will fix it for the next release around mid November

jvenant-up commented 11 months ago

hello, any update on this ?

mrpotato3 commented 9 months ago

+1

wohnout commented 9 months ago

Having the same issues upgrading from R12 to R15. Takes ages to resolve.

MagnusJohansson commented 7 months ago

@fsismondi I have talked with support and they said it is because of a bug on their end. They said they will fix it for the next release around mid November

Did they say which year?

It's now April 2024, I just tried to upgrade 14 to 16:

{
  "code": "InternalServerError",
  "message": "An unexpected error occured while processing the request. Tracking ID: '3b54f416-f0a2-40a3-83ad-e9aa736f08ed'"
}
wohnout commented 7 months ago

I had luck to do upgrade today, after 38 days with support ... Incredible service.

fsismondi commented 7 months ago

We are in the situation where we need to upgrade 10+ servers to v14. Create new ones and backup/restore would mean a lot of man hours, .

Disabling extensions seems not enough for making the procedure work.

Please somebody from @microsoftopensource , maybe @ramnov @rachel-msft @ambrahma can provide some info on what is happening here? Support has been useless regarding this issue

koushikchitta commented 7 months ago

@charalamm , @fsismondi we have a known issue of major upgrade failure due to timeouts when the server has large number of databases and/or schemas. We are working on a fixing on priority. Sorry for the inconvenience caused. Can you raise a support ticket with your servers and share it here and I will personally follow up to make sure to address them ?

wohnout commented 7 months ago

Mine is 2402280050000780 and I would like to know what happened and if it is resolved the way it will not happen again

fsismondi commented 7 months ago

Ours is 2312180050002992, this ticket was closed though. Support was -put plainly- useless. We are interested in knowing a safe and reproducible procedure we can follow start migrating all our servers.

MichelZ commented 7 months ago

Yeah, I'm currently doing manual migrations because of this, and it's a PITA

lieberlois commented 7 months ago

Same issue here 👍

mecostav commented 7 months ago

dont have a support ticket but i have 2 postgres instances which I'm unable to dump to 16. any ETA on the fix?

benoittgt commented 7 months ago

Apparently they are working on it at the moment. https://ruby.social/@clairegiordano@hachyderm.io/112254606338198662

sergiuser1 commented 7 months ago

Still not fixed after 8 months

lieberlois commented 7 months ago

After having contact with Azure Premium Support we were told they will fix this at the end of April. Its a Problem on Microsoft Side 👍

koushikchitta commented 7 months ago

This is a high level error with different underlying issues. Would request others as well to raise the support ticket to address them. Add your ASC ticket here if you don't get traction and I will prioritize it.

tanadeau commented 6 months ago

I just tried to do an upgrade again and am still receiving the same error. Has there been any updates on the underlying issue?

charalamm commented 6 months ago

We have contacted support and they have made something to our databases that had this issue that allowed upgrading. So I guess it's solvable via support

benoittgt commented 6 months ago

Just tested here and it worked.

StrangeWill commented 6 months ago

We just ran into this tonight trying to upgrade 13 -> 16 and 13 -> 14, trying 13 -> 15 because we really need to fix a bug that was fixed in Postgres v14.

JKolios commented 5 months ago

As a heads-up: I ran into this issue last week while trying to upgrade from Postgres 11 to 13. Azure support told me to, quote:

We request you to try the upgrade operation after following the below steps:
1. Please drop the extension “hypopg” in database level.
2. Then disable the “hypopg” extension in server. To disable follow the below:
Server parameter-->azure.extensions-->hypopg

After performing the above two steps, please try to upgrade the server and let us know if you face any issues.

And this worked fine. I was also able to re-install hypopg afterwards.

ahjaworski commented 3 months ago

We also encountered this problem. We use the pgrouting extension which cannot always be upgraded. We had to do the following steps for the upgrade:

  1. Drop the pgrouting extension on databases
  2. Perform the database upgrade in Azure Portal
  3. Re-add the extension to the databases
Aerodynamite commented 3 months ago

Same issue here, even when testing with a brand new 12.19 psql database without any extensions. Tried multiple times, upgrade to 13, 14, 15 and 16 all failed.

MichelZ commented 3 months ago

I have a new case open for 2 servers (once v14, one v15) who both refuse to update. They are currently escalating to the product team. We have enabled upgrade logs and get

The source cluster was not shut down cleanly. Failure, exiting

There are no extensions installed

lieberlois commented 3 months ago

FYI: we successfully migrated with the Terraform Provider from PG 11 to PG 16 this weekend 😄

Aerodynamite commented 3 months ago

I just tried it using Terraform as well, without success. Below is the full terraform code I used. Applied it first with create_mode Default and version 12, made the indicated changes and applied again.

Terraform

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.116.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~>3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

data "azurerm_resource_group" "rg" {
  name = "rg-wds-dev-weu-001"
}

resource "azurerm_postgresql_flexible_server" "psql_upgrade_test" {
  name                = "psql-wds-upgrade-test-terraform-001"
  resource_group_name = data.azurerm_resource_group.rg.name
  location            = "westeurope"

  backup_retention_days        = 7
  geo_redundant_backup_enabled = false
  create_mode                  = "Default" -> "Update"
  version                      = "12" -> "16"
  storage_mb                   = 32768
  sku_name                     = "B_Standard_B1ms"
  zone                         = 2

  administrator_login    = "thisismyadmin"
  administrator_password = "super-secret-password"

  public_network_access_enabled = true
}

resource "azurerm_postgresql_flexible_server_database" "psqldb_testdatabase" {
  name      = "testdatabase"
  server_id = azurerm_postgresql_flexible_server.psql_upgrade_test.id
  collation = "en_US.utf8"
  charset   = "utf8"
}

resource "azurerm_postgresql_flexible_server_firewall_rule" "psqlfr_azure_services" {
  name             = "Allow-public-azure-service-access"
  server_id        = azurerm_postgresql_flexible_server.psql_upgrade_test.id
  start_ip_address = "0.0.0.0"
  end_ip_address   = "0.0.0.0"
}

Output

╷
│ Error: updating Flexible Server (Subscription: "<redacted>"
│ Resource Group Name: "rg-wds-dev-weu-001"
│ Flexible Server Name: "psql-wds-upgrade-test-terraform-001"): polling after Update: polling failed: the Azure API returned the following error:
│ 
│ Status: "InternalServerError"
│ Code: ""
│ Message: "An unexpected error occured while processing the request. Tracking ID: 'd00608f4-b633-41a2-af69-557cd4ee258c'"
│ Activity Id: ""
│ 
│ ---
│ 
│ API Response:
│ 
│ ----[start]----
│ {"name":"f3a142f9-aacd-4eb2-a172-f84dd899a991","status":"Failed","startTime":"2024-08-27T09:00:24.43Z","error":{"code":"InternalServerError","message":"An unexpected error occured while processing the request. Tracking ID: 'd00608f4-b633-41a2-af69-557cd4ee258c'"}}
│ -----[end]-----
│ 
│ 
│   with azurerm_postgresql_flexible_server.psql_smartlab_api,
│   on main.tf line 23, in resource "azurerm_postgresql_flexible_server" "psql_upgrade_test":
│   23: resource "azurerm_postgresql_flexible_server" "psql_upgrade_test" {
│ 
╵
lieberlois commented 3 months ago
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.0"
    }
  }
}

This is pretty old - we used "~> 3.116.0" - maybe that helps?

Aerodynamite commented 3 months ago

Just tried it with 3.116.0 as well but the issue is the same. I have reached out to Microsoft support as well and they are currently investigating the issue.

ttichy commented 2 months ago

Still getting this error while trying to upgrade PG flexible server via the portal

wohnout commented 2 months ago

We had issue with upgrade recently and doing restart and upgrade next day resolved the issue.

Aerodynamite commented 2 months ago

After a lot of communication with Microsoft Support, they were finally able to upgrade my instance. Here is the feedback I received from them:

-> Initially you experienced an MVU (Maintenance and Version Upgrade) failure due to pending_restart parameter was set to true. this means the server needed a restart before the upgrade could proceed. -> An engineer restarted the container, allowing you to try the MVU again. -> During the retry, the MVU failed again due to insufficient disk space in the /tmp directory. This directory didn’t have enough space to handle the upgrade process. -> Memory Issue: The B1ms SKU (a specific server configuration) has less than 1 GB of memory available for the cluster, which can cause MVU failures if the memory is nearly full. -> We have addressed this in an upcoming release which removes the dependency for MVU. -> For the third MVU attempt, our engineer initiated the upgrade from the backend and the upgrade proceeded without issue.

This confirms the behavior that some users are reporting here that simply restarting their instance fixed the problem. Until they have shipped the new release which removes the dependency on available memory, temporarily upscaling your instance to one with more disk space and/or more memory might fix the problem as well.