dwyl / auth

πŸšͺ πŸ” UX-focussed Turnkey Authentication Solution for Web Apps/APIs (Documented, Tested & Maintained)
https://authdemo.fly.dev
GNU General Public License v2.0
136 stars 9 forks source link

Internal Server Error 😒 πŸ”₯ #325

Open nelsonic opened 1 year ago

nelsonic commented 1 year ago

Is anyone else seeing an Internal Server Error on https://auth.dwyl.com/ ??

IMG_4570

Todo

iteles commented 1 year ago

Yep. https://github.com/dwyl/mvp/issues/440

iteles commented 1 year ago

Suspect it has something to do with this email received from fly.io 16 hours ago:

image
LuchoTurtle commented 1 year ago

Also confirm this error on my end 😒

nelsonic commented 1 year ago

Lame. Thanks fly.io. The reward for being a loyal+paying customer... πŸ€¦β€β™‚οΈ

nelsonic commented 1 year ago

Thankfully we have a "Staging" instance that still works flawlessly: https://authdemo.fly.dev/ So I'm temporarily switching over to that on the MVP until we can recover authprod.

image

Just re-deploying now because the machine restart didn't work to update the environment variable. πŸ™ƒ

nelsonic commented 1 year ago

MVP now redirects to authdemo:

image

Works for me:

image

@iteles / @LuchoTurtle please confirm you're able to access your items using the authdemo login. Thanks.

nelsonic commented 1 year ago

Definitely still going to try and recover the authprod instance when I have a bit more time. (maybe only tomorrow)

iteles commented 1 year ago

Interesting. It's working for me on Firefox desktop!

On Chrome desktop, I'm still getting an internal server error straight away (on the start page, not even getting to the login portion) even after clearing cookies and cached data πŸ€·β€β™€οΈ

But fixed for now, no point troubleshooting this further until we're back on prod.

nelsonic commented 1 year ago

@iteles In Chrome, if you visit https://mvp.fly.dev/ what do you see?

iteles commented 1 year ago
image
nelsonic commented 1 year ago

Well that's mega frustrating! Thank you Google Chrome! πŸ€¦β€β™‚οΈ But it's useful feedback that someone using the App could encounter the same error. πŸ’­ πŸ™ƒ Glad you proactively tested it in Firefox. πŸ‘Œ Can you still see all your items?

nelsonic commented 1 year ago

Works on my phone:

IMG_4593

but doesn’t look great. Some tidy up to be done. ⏳

iteles commented 1 year ago

Works on Chrome on my phone but I can't see any of my items.

nelsonic commented 1 year ago

Ok. So that means your items are tied to the person.id from authprod. πŸ€¦β€β™‚οΈ This is very useful insight. Need to make the person.id a cid based on the authprovider -> userid πŸ’­

LuchoTurtle commented 1 year ago

Only works if I clear my cookies. I'm doing this on the web. Although it now works after clearing them, all of my items are gone as well.

nelsonic commented 1 year ago

Trying to re-deploy authprod ... 🀞

n@MBP auth % fly deploy --verbose
==> Verifying app config
Validating /auth/fly.toml
Platform: machines
βœ“ Configuration is valid
--> Verified app config
==> Building image
Waiting for remote builder fly-builder-solitary-rain-5830... 🌎
nelsonic commented 1 year ago

Useless! 😒

WARN Remote builder did not start in time. Check remote builder logs with `flyctl logs -a fly-builder-solitary-rain-5830`

WARN Failed to start remote builder heartbeat: remote builder app unavailable

WARN failed to create build in graphql: returned error 504 Gateway Timeout: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Error: failed to fetch an image or build from source: error connecting to docker: server returned a non-200 status code: 503
nelsonic commented 1 year ago

Web UI is unresponsive: https://fly.io/apps/authprod πŸ€·β€β™‚οΈ 😒

image

Gotta get going to Lx now. Will resume trying to figure this out later tonight. 🀞

nelsonic commented 1 year ago

Cannot access any of Fly.io dashboard now ...

image

"Something went wrong" ... indeed. πŸ€¦β€β™‚οΈ

nelsonic commented 1 year ago

Right. For the first time in a very long time I'm away from @home (family) 😒 and have a couple of hours to focus. πŸ§‘β€πŸ’» 🎯 Going to attempt to recover the authprodb and see if I can get the instance running again. 🀞 Will document every step along the way (as always). πŸ“ Wish me luck! πŸ€

nelsonic commented 1 year ago

Ok. Step 1: Attempt to create a backup of authprodb in case any of this goes wrong. πŸ”™

nelsonic commented 1 year ago

Following this helpful article: https://fly.io/docs/postgres/managing/backup-and-restore/

fly volumes list -a authprodb

ID                      STATE   NAME    SIZE    REGION  ZONE    ENCRYPTED   ATTACHED VM CREATED AT
vol_g2yxp4mllj6v63qd    created pg_data 10GB    lhr     34c3    false                   2 years ago
vol_1g67340ggznrydxw    created pg_data 10GB    lhr     805e    false                   2 years ago

ATTACHED VM: blank. There's your problem ...

image

Note: "ECNRYPTED: false" ... Not encrypted by Fly.io. But all the sensitive data is encrypted using fields.

List the snapshots:

fly volumes snapshots list vol_g2yxp4mllj6v63qd
Snapshots
ID                  SIZE        CREATED AT
vs_QGPN6pQObpLQpUma 96479939    10 hours ago
vs_za8NQRN70Dmnat9K 96479939    1 day ago
vs_8zZLpNxD3VB9Qhom 95992552    2 days ago
vs_OANN6OZkLJLNfz7  95895907    3 days ago
vs_GpqzBBJgL7DmltGK 95019181    4 days ago
vs_bjjMlxqMzXjl7sB  94680799    5 days ago

Dunno why it's continuing to take "dumb" snapshots if the data isn't changing. πŸ€·β€β™‚οΈ

So that guide doesn't actually deliver on "backing up" the DB. So I searched for: https://www.google.com/search?q=fly.io+postgres+backup+to+localhost

And the first hit is: https://community.fly.io/t/backup-and-restore-postgresql-in-fly-io-step-by-step/11455 image

No formatting of the input/output and using Windows πŸ™„ ... But it's better than nothing. Going to follow the steps.

First thing I had to do was find out what the Postgres password is for the authprodb instance as I don't have in any password manager.

https://community.fly.io/t/lost-postgres-password/4192/2

fly ssh console
echo $DATABASE_URL
postgres://authprod:redacted@top2.nearest.of.authprodb.internal:5432/authprod

This also tells us that the internal TCP port used is 5432. πŸ‘

Attempt to proxy the internal TCP port to a local one:

flyctl proxy 15432:5432 -a authprodb

Sadly, we get:

Error: authprodb.internal: host was not found in DNS

Not good. How about connecting directly to the Postgres VM?

fly ssh console -a authprodb

Got:

Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:

fly status

Tried:

fly status -a authprodb

got:

No machines are available on this app authprodb

But the Fly.io Web interface says it's "deployed": https://fly.io/apps/authprodb πŸ€·β€β™‚οΈ

image

But the instance doesn't have an IP Address so it's not addressable ... πŸ™ˆ πŸ™ƒ Going to try and temporarily allocate an IP V4 Address: https://fly.io/docs/flyctl/ips/#usage

flyctl ips allocate-v4 -a authprodb

Prompt:

? Looks like you're accessing a paid feature. Dedicated IPv4 addresses now cost $2/mo. Are you ok with this? (y/N)

Sure, why not? If $2/month resolves this (and I will immediately remove the IP address) let's go!

Appears to have worked:

VERSION IP              TYPE    REGION  CREATED AT
v4      137.66.62.95    public  global  19s ago

137.66.62.95 is now listed in the IP Addresses section:

image

Let's try connecting to it again. 🀞

fly ssh console -a authprodb

No started VMs:

Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:

fly status

Great. So it allows me to attach a paid IP address to a VM that is "deployed" but not reachable. πŸ‘Œ (πŸ€¦β€β™‚οΈ)

Given that it's useless to us, let's release that IPV4 address before we forget:

flyctl ips release 137.66.62.95

Confirmed:

Released 137.66.62.95 from authprod

Going to try a different approach. Let's grab the latest snapshot and create a new Postgres instance from it! πŸ†• 🀞 That way if I mess up, it's just on a new VM that can easily be dropped.

Recall from above that the latest snapshot ID was vs_QGPN6pQObpLQpUma ⬆️

Run:

fly postgres create --snapshot-id vs_QGPN6pQObpLQpUma

But got:

Error:  "Failed to resolve source machine from snapshot."

😒

Try to fork the db:

fly postgres create --fork-from authprodb

Fails:

Error: No machines associated with fork-from target. See `fly pg create --help` for more information

Helpful. πŸ€¦β€β™‚οΈ

Reading: https://fly.io/docs/postgres/postgres-on-nomad/#upgrading-the-postgres-app

nelsonic commented 1 year ago

Tried:

flyctl image show -a authprodb

Output:

MACHINE ID  REGISTRY    REPOSITORY  TAG VERSION DIGEST  LABELS

πŸ¦— πŸ¦— πŸ¦—... (i.e: nothing...!)

Reading this very recent forum topic: https://community.fly.io/t/postgres-crashed-cannot-restart-restore-or-fork/16186

Sadly not a very helpful comment from the OP: πŸ™ƒ

image

Read: https://fly.io/docs/flyctl/postgres-restart/

Tried:

flyctl postgres restart -a authprodb

Got:

Error: no active leader found

Read: https://fly.io/docs/apps/restart/

Tried restarting:

fly apps restart authprodb

Got:

Error: postgres apps should use `fly pg restart` instead

Great! Thanks fly ... πŸ€¦β€β™‚οΈ

Trying to "deploy" a machine:

fly deploy -a authprodb --verbose

Got the following output + error:

-------
 βœ– release_command failed
-------
Error release_command failed running on machine 1857439c5996d8 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 1857439c5996d8':
  Pulling container image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ
  Successfully prepared image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ (4.634590051s)
  Configuring firecracker
  [    0.042013] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
  [    0.047890] PCI: Fatal: No config space access function found
   INFO Starting init (commit: 15238e9)...
   INFO Preparing to run: `/app/bin/migrate` as nobody
   INFO [fly api proxy] listening at /.fly/api
  2023/11/03 07:22:14 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
  ERROR! Config provider Config.Reader failed with:
  ** (RuntimeError) environment variable DATABASE_URL is missing.
  For example: ecto://USER:PASS@HOST/DATABASE
      (stdlib 3.16.1) erl_eval.erl:685: :erl_eval.do_apply/6
      (stdlib 3.16.1) erl_eval.erl:446: :erl_eval.expr/5
      (stdlib 3.16.1) erl_eval.erl:123: :erl_eval.exprs/5
      (elixir 1.12.3) src/elixir.erl:280: :elixir.recur_eval/3
      (elixir 1.12.3) src/elixir.erl:265: :elixir.eval_forms/3
      (elixir 1.12.3) lib/code.ex:656: Code.eval_string_with_error_handling/3
      (elixir 1.12.3) lib/config.ex:258: Config.__eval__!/3
      (elixir 1.12.3) lib/config/reader.ex:86: Config.Reader.read!/2
  {"init terminating in do_boot",{#{'__exception__'=>true,'__struct__'=>'Elixir.RuntimeError',message=><<101,110,118,105,114,111,110,109,101,110,116,32,118,97,114,105,97,98,108,101,32,68,65,84,65,66,65,83,69,95,85,82,76,32,105,115,32,109,105,115,115,105,110,103,46,10,70,111,114,32,101,120,97,109,112,108,101,58,32,101,99,116,111,58,47,47,85,83,69,82,58,80,65,83,83,64,72,79,83,84,47,68,65,84,65,66,65,83,69,10>>},[{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,685}]},{erl_eval,expr,5,[{file,"erl_eval.erl"},{line,446}]},{erl_eval,exprs,5,[{file,"erl_eval.erl"},{line,123}]},{elixir,recur_eval,3,[{file,"src/elixir.erl"},{line,280}]},{elixir,eval_forms,3,[{file,"src/elixir.erl"},{line,265}]},{'Elixir.Code',eval_string_with_error_handling,3,[{file,"lib/code.ex"},{line,656}]},{'Elixir.Config','__eval__!',3,[{file,"lib/config.ex"},{line,258}]},{'Elixir.Config.Reader','read!',2,[{file,"lib/config/reader.ex"},{line,86}]}]}}
  init terminating in do_boot ({,[{erl_eval,do_apply,6,[{_},{_}]},{erl_eval,expr,5,[{_},{_}]},{erl_eval,exprs,5,[{_},{_}]},{elixir,recur_eval,3,[{_},{_}]},{elixir,eval_forms,3,[{_},{_}]},{Elixir.Code,ev
  Crash dump is being written to: erl_crash.dump...done
   INFO Main child exited normally with code: 1
   INFO Starting clean up.
   WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
  2023/11/03 07:22:15 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
  [    2.293675] reboot: Restarting system
  machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 1857439c5996d8 exited with non-zero status of 1

Dunno why it's trying to run the release command ... obviously it's going to fail!!

Trying to fork from the volume:

fly postgres create --fork-from authprodb:vol_g2yxp4mllj6v63qd

Output:

Creating app...
Setting secrets on app authprodb2...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:5e5fc53decb051f69b0850f0f5d137c92343fcd1131ec413015e5260620947ac
Waiting for machine to start...
Machine 1857509c162168 is created
==> Monitoring health checks
  Waiting for 1857509c162168 to become healthy (started, 3/3)

Postgres cluster authprodb2 created
  Username:    postgres
  Password:    redacted
  Hostname:    authprodb2.internal
  Flycast:     fdaa:0:42c6:0:1::6
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:redacted@authprodb2.flycast:5432

Save your credentials in a secure place -- you won't be able to see them again!

Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/

That appears to have worked:

Reading: https://fly.io/docs/postgres/getting-started/what-you-should-know/ Recommends using a different service if you want managed Postgres ... Rabbit hole ...

Back to getting this working!

Tried:

flyctl postgres attach authprodb2 -a authprod

Got:

Error: consumer app "authprod" already contains a secret named DATABASE_URL

Duh.

Let's quickly connect to the newly created authprodb2 and see if it has data:

fly postgres connect -a authprodb2

List the databases:

\l

                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

None of these look like what we need. πŸ™„

Going to need to take a step back ... πŸ’­

nelsonic commented 1 year ago

Connecting to the (working) demo version of auth Postgres instance:

fly postgres connect -a authdemo-db

We see that the databases are:

\l
                                    List of databases
   Name    |   Owner    | Encoding |  Collate   |   Ctype    |     Access privileges
-----------+------------+----------+------------+------------+---------------------------
 authdemo  | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 |
 postgres  | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin
 template1 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin

Instead of the repmgr there is an authdemo database. πŸ’­

As per the echo $DATABASE_URL above:

postgres://authprod:redacted@top2.nearest.of.authprodb.internal:5432/authprod

We're expecting a database called authprod to exist in the newly created Postgres instance. πŸ€·β€β™‚οΈ

nelsonic commented 1 year ago

This got me scratching my head ... πŸ€” If we look at the https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db is using 182 MB the available volume:

image

Whereas the authprodb volumes https://fly.io/apps/authprodb/volumes are using 0mb ... image

Going to contact Fly.io support.

nelsonic commented 1 year ago

Just in case I was going crazy, I tried to deploy the authprod app one last time:

fly deploy --verbose

Output:

==> Verifying app config
Validating /auth/fly.toml
Platform: machines
βœ“ Configuration is valid
--> Verified app config
==> Building image
Remote builder fly-builder-solitary-rain-5830 ready
==> Building image with Docker
--> docker host: 20.10.12 linux x86_64
[+] Building 1.8s (30/30) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                     1.0s
 => => transferring dockerfile: 32B                                                                                                                                                                                      1.0s
 => [internal] load .dockerignore                                                                                                                                                                                        1.0s
 => => transferring context: 34B                                                                                                                                                                                         1.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-20210902-slim                                                                                                                                         0.3s
 => [internal] load metadata for docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim                                                                                                               0.4s
 => [internal] load build context                                                                                                                                                                                        0.3s
 => => transferring context: 29.65kB                                                                                                                                                                                     0.3s
 => [stage-1 1/6] FROM docker.io/library/debian:bullseye-20210902-slim@sha256:e3ed4be20c22a1358020358331d177aa2860632f25b21681d79204ace20455a6                                                                           0.0s
 => [builder  1/18] FROM docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim@sha256:a2447f82e9125abb7edb0708eb51da638d8f40e7fda4b3ed5151bee883d476be                                               0.0s
 => CACHED [stage-1 2/6] RUN apt-get update -y && apt-get install -y libstdc++6 openssl libncurses5 locales   && apt-get clean && rm -f /var/lib/apt/lists/*_*                                                           0.0s
 => CACHED [stage-1 3/6] RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen                                                                                                                                0.0s
 => CACHED [stage-1 4/6] WORKDIR /app                                                                                                                                                                                    0.0s
 => CACHED [stage-1 5/6] RUN chown nobody /app                                                                                                                                                                           0.0s
 => CACHED [builder  2/18] RUN apt-get update -y && apt-get install -y build-essential git     && apt-get clean && rm -f /var/lib/apt/lists/*_*                                                                          0.0s
 => CACHED [builder  3/18] WORKDIR /app                                                                                                                                                                                  0.0s
 => CACHED [builder  4/18] RUN mix local.hex --force &&     mix local.rebar --force                                                                                                                                      0.0s
 => CACHED [builder  5/18] COPY .env_sample ./                                                                                                                                                                           0.0s
 => CACHED [builder  6/18] COPY mix.exs mix.lock ./                                                                                                                                                                      0.0s
 => CACHED [builder  7/18] RUN mix deps.get --only prod                                                                                                                                                                  0.0s
 => CACHED [builder  8/18] RUN mkdir config                                                                                                                                                                              0.0s
 => CACHED [builder  9/18] COPY config/config.exs config/prod.exs config/                                                                                                                                                0.0s
 => CACHED [builder 10/18] RUN mix deps.compile                                                                                                                                                                          0.0s
 => CACHED [builder 11/18] COPY priv priv                                                                                                                                                                                0.0s
 => CACHED [builder 12/18] COPY assets assets                                                                                                                                                                            0.0s
 => CACHED [builder 13/18] RUN mix assets.deploy                                                                                                                                                                         0.0s
 => CACHED [builder 14/18] COPY lib lib                                                                                                                                                                                  0.0s
 => CACHED [builder 15/18] RUN mix compile                                                                                                                                                                               0.0s
 => CACHED [builder 16/18] COPY config/runtime.exs config/                                                                                                                                                               0.0s
 => CACHED [builder 17/18] COPY rel rel                                                                                                                                                                                  0.0s
 => CACHED [builder 18/18] RUN mix release                                                                                                                                                                               0.0s
 => CACHED [stage-1 6/6] COPY --from=builder --chown=nobody:root /app/_build/prod/rel/auth ./                                                                                                                            0.0s
 => exporting to image                                                                                                                                                                                                   0.0s
 => => exporting layers                                                                                                                                                                                                  0.0s
 => => writing image sha256:5ba72f99464041b84ed2e1491eb709c1518dff88852706229e8d82f0f3af2d4c                                                                                                                             0.0s
 => => naming to registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02                                                                                                                                          0.0s
--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/authprod]
7b5d131176f6: Layer already exists
4d5c3ee4879b: Layer already exists
0166b8c07570: Layer already exists
4dfd7d2f8a33: Layer already exists
11a55e4db61e: Layer already exists
ba5a5fe43301: Layer already exists
deployment-01HEADJTX39GT10NA7TY49HG02: digest: sha256:f5d586ae7234d29f8ef485ce6c06e5ec469f6ae66d77886459780fc172337147 size: 1575
--> Pushing image done
image: registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
image size: 120 MB

Watch your deployment at https://fly.io/apps/authprod/monitoring

Running authprod release_command: /app/bin/migrate

-------
 βœ– release_command failed
-------
Error release_command failed running on machine 48ed037a343368 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 48ed037a343368':
  Pulling container image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
  Successfully prepared image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02 (5.834199334s)
  [    0.041401] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
  [    0.042773] PCI: Fatal: No config space access function found
   INFO Starting init (commit: 15238e9)...
   INFO Preparing to run: `/app/bin/migrate` as nobody
   INFO [fly api proxy] listening at /.fly/api
  2023/11/03 11:02:57 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
  11:02:58.710 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:02:58.710 [error] Postgrex.Protocol (#PID<0.135.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:02:59.943 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:03:00.660 [error] Could not create schema migrations table. This error usually happens due to the following:
    * The database does not exist
    * The "schema_migrations" table, which Ecto uses for managing
      migrations, was defined by another library
    * There is a deadlock while migrating (such as using concurrent
      indexes with a migration_lock)
  To fix the first issue, run "mix ecto.create".
  To address the second, you can run "mix ecto.drop" followed by
  "mix ecto.create". Alternatively you may configure Ecto to use
  another table and/or repository for managing migrations:
      config :auth, Auth.Repo,
        migration_source: "some_other_table_for_schema_migrations",
        migration_repo: AnotherRepoForSchemaMigrations
  The full error report is shown below.
  ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2975ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
    1. Ensuring your database is available and that you can connect to it
    2. Tracking down slow queries and making sure they are running fast enough
    3. Increasing the pool_size (although this increases resource consumption)
    4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
  See DBConnection.start_link/2 for more information
      (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:760: Ecto.Adapters.SQL.raise_sql_call_error/1
      (elixir 1.12.3) lib/enum.ex:1582: Enum."-map/2-lists^map/1-0-"/2
      (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:852: Ecto.Adapters.SQL.execute_ddl/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:678: Ecto.Migrator.verbose_schema_migration/3
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:504: Ecto.Migrator.lock_for_migrations/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:419: Ecto.Migrator.run/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:146: Ecto.Migrator.with_repo/3
      (auth 1.6.6) lib/auth/release.ex:12: anonymous fn/2 in Auth.Release.migrate/0
   INFO Main child exited normally with code: 1
   WARN Reaped child process with pid: 359 and signal: SIGUSR1, core dumped? false
   INFO Starting clean up.
   WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)
  2023/11/03 11:03:01 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
  [    5.255192] reboot: Restarting system
  machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1

fly deploy definitely doesn't work because it cannot run the mix release command because there is no Postgres instance available to migrate. πŸ™…

nelsonic commented 1 year ago

Posted Forum topic: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247

Snapshot: image

Full text of my forum topic in case it gets deleted:

Opening this question in case anyone `else` is in the same boat (having their `Nomad`-based `Postgres` instance killed ...) 

# Context

We deployed our [`auth`](https://github.com/dwyl/auth) (`Elixir/Phoenix`) app to Fly.io in 2022. :rocket: 
It was a good experience and worked flawlessly until they killed it. :neutral_face: 

# Fly.io App Disabled πŸ’€

We received the following email from Fly.io: 

![image|690x464](upload://xgMVmyDeZnevWHqqpE3bKQLJFHV.png)

## `Fulltext` of the email for Google indexing/findability πŸ“§

```md
Hi. We’re in the process of removing Nomad from Fly.io. This means any apps that haven’t been upgraded to Apps V2 will no longer function.

Your app authprodb is one of those apps. Here’s what this means for you:

1. authprodb is currently disabled.
  - That means the Nomad instances (virtual machines) 
     that ran your app have been shut down, but that the app’s configuration still exists.
2. Your app is technically now on the β€œmachines” platform, rather than β€œnomad.”
  - That means that any subsequent fly deploy on your app will attempt to launch Fly Machines.
  - In many cases, a deploy will immediately work. In some cases, there may be some additional configuration required to get your app running on Machines. 
    The community forum is a great place to get help troubleshooting your deployment if you get stuck.
3. Nothing has been deleted.
  - Your app’s Docker image still exists, and any Volumes that were attached to your app still exist.

Thanks for bearing with us throughout this whole migration saga. It’s not been easy, but the results should soon speak for themselves.

If you have any questions or comments, we’ve posted a similar announcement on the community forum.

Thanks, Your Friends at Fly.io

They key point of the email: they shut down our Postgres instance and according to point 3 "Nothing has been deleted".

"Nothing has been deleted" ... :thought_balloon:

So I dived into trying to recover the database. Tried re-deploying the authprod App that was attached to the authprodb Postgres instance:

fly deploy --verbose

It errors because the Elixir App cannot run the mix release command:

Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1

Full log output in: https://github.com/dwyl/auth/issues/325#issuecomment-1792265725

After a couple of hours searching and trying everything I could think of with zero progress. I'm opening this forum topic.

Question: Why is the Volume Used 0 MB?

Viewing the volumes for the authprodb, they have Used 0 MB i.e. no data https://fly.io/apps/authprodb/volumes image

image

Luckily, we have a working "staging" version of the auth App also hosted on Fly.io: https://authdemo.fly.dev/ and can easily compare the Volume: https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db is using 182 MB the available volume:

image

This suggests that our data was in fact deleted when Fly.io disabled our App.

If anyone has seen this issue and can help us recover our Postgres data, please comment! :pray:

nelsonic commented 1 year ago

Replied to the support email:

image

Ball is in their court. 🎾 I get the sinking feeling that the data for authprodb is lost. πŸ’­ Which is going to be a huge headache. πŸ˜– But let's wait and see what they say. ⏳

AFK to help parents with something for the next couple of hours. 🀞

nelsonic commented 1 year ago

Meanwhile I've wasted my whole morning on this ... ⏳ πŸ”₯ 😒 image

nelsonic commented 1 year ago

Followed-up with support@fly.io by email. If the data for authprodb is in fact gone, we need to move on. πŸ’­

nelsonic commented 1 year ago

Got a couple of replies on the forum: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/5

image

Tried creating a fork from the second volume:

fly postgres create --fork-from authprodb:vol_1g67340ggznrydxw

Output:

? Select VM size: shared-cpu-1x - CPU Kind: Shared, vCPUs: 1 Memory: 256MB
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app authprodb3...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6
Waiting for machine to start...
Machine e82d377a069728 is created
==> Monitoring health checks
  Waiting for e82d377a069728 to become healthy (started, 3/3)

Postgres cluster authprodb3 created
  Username:    postgres
  Password:    redacted
  Hostname:    authprodb3.internal
  Flycast:     fdaa:0:42c6:0:1::7
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:redacted@authprodb3.flycast:5432

Save your credentials in a secure place -- you won't be able to see them again!

Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/

Connect:

fly postgres connect -a authprodb3
\l
                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

these-are-not-the-tables

I replied on the thread.

nelsonic commented 1 year ago

Got a reply on the forum (but still nothing in email ...) https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/8

image

Only saw it now ... didn't get a notification from the forum. We need a "Reminders" feature.

Going to try the command now:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region CHOOSE --org your-org --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_yaddayaddayadda

Re-ran the command to check the snapshots list:

fly volume snapshots list vol_g2yxp4mllj6v63qd

Snapshots
ID                  SIZE        CREATED AT
vs_vYXP8wx6ezM53CR  96479939    11 hours ago
vs_vnAvN76g36Apwt2A 96479939    1 day ago
vs_azyMXoAexXOG9tA7 96479939    2 days ago
vs_JjO3bbl7nbV2oSX5 96479939    3 days ago
vs_VD9Zm3LRVqQ6lfz7 96479939    4 days ago
vs_gnkGR2jXV7yM3sqN 96479939    5 days ago

Using the latest snapshot: vs_vYXP8wx6ezM53CR

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...

Gotta go be a Dad for a few hours. BRB. ⏳

nelsonic commented 1 year ago
fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine e784434c420ed8 is created
==> Monitoring health checks
  Waiting for e784434c420ed8 to become healthy (started, 0/3)
Error: context deadline exceeded

great. going to have to run the command again ...

nelsonic commented 1 year ago

Posted a comment on the forum topic to this effect: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/9

Not much else I can do on this right now. Other than creating a new empty DB and starting again. πŸ’­

nelsonic commented 1 year ago

Given the error above, I had deleted new-db to avoid incurring unnecessary charges. So I had to re-run the command to re-create new-db:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_jXlNXQLxgLDeXclV

But sadly, Fly.io is not fully operational right now: https://fly.io/dashboard/dwyl-auth/status

fly-status-docker-fail

So the response is:

Creating app...
Error: server returned a non-200 status code: 504

Will need to revisit this in the morning. ⏳

nelsonic commented 1 year ago

Resume: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/14

nelsonic commented 12 months ago

At home now and still getting error trying to revive the DB:

Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine 784e679b453408 is created
==> Monitoring health checks
  Waiting for 784e679b453408 to become healthy (started, 0/3)
Error: context deadline exceeded
nelsonic commented 12 months ago

But when visiting the dashboard, we see that the new-db was created: https://fly.io/dashboard/dwyl-auth-546

image

Replied on the thread: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/16?u=dwyl_auth

nelsonic commented 10 months ago

On the advice of the Daniel https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/18 I created a new DB instance to attempt to recover the data ...

https://fly.io/dashboard/dwyl-auth-546

image

It didn't work and the thread was automatically closed without any resolution. 😒 But the instance is still running and costing us a small fortune!! So I'm killing it now:

image

The prod instance is still not recovered. 😒 Really wish I had someone to pair on this ...

image

nelsonic commented 3 months ago

Going to attempt to host the Postgres on DigitalOcean https://github.com/dwyl/learn-devops/issues/90 with the Auth App still running on Fly.io ref: https://github.com/dwyl/learn-devops/issues/93

Meanwhile our Auth App has a mountain of credits: https://fly.io/dashboard/dwyl-auth-546

image

And since the first $5/month is "free", we aren't consuming them. So I might bump the App up a notch once the DB transferred to DigitalOcean.

But just to say I'm tackling this now! ⏳ (or at least investing an hour into this and seeing how far I get ...)

nelsonic commented 3 months ago
fly auth login
Opening https://fly.io/app/auth/cli/6c617a6637656868646571 ...

Waiting for session... Done
successfully logged in as nelsonic
fly status
App
  Name     = authprod
  Owner    = dwyl-auth-546
  Hostname = authprod.fly.dev
  Image    = authprod:deployment-01GK3WCNQ8Q374

Machines
PROCESS ID              VERSION REGION  STATE   ROLE    CHECKS              LAST UPDATED  
app     6e82d924ad3742  58      lhr     started         1 total, 1 passing  2023-11-03T07:20:03Z    

All running. πŸ†—

nelsonic commented 3 months ago

Going to detach the (empty) DB: https://fly.io/docs/postgres/managing/attach-detach/#detach-a-fly-app

Example:

fly postgres detach <postgres app name> --app <app name>

Ours:

fly postgres detach  authprodb --app authprod

Got:

Error: no 6pn ips founds for authprodb app

Didin't expect it to be easy ... πŸ™„ But what is this error? πŸ€·β€β™‚οΈ

nelsonic commented 3 months ago

Exact error message for forum title/topic: https://community.fly.io/t/6pn-ips-are-not-static-re-fly-postgres-last-thurs/20022

Reading now ... ⏳

nelsonic commented 3 months ago

Turns out the authprod App is not connected to the authprodb: https://fly.io/docs/postgres/managing/attach-detach/#check-if-a-fly-app-is-attached-to-a-postgres-app

fly postgres users list --app authprodb
Error: no active leader found

Fly Postgres is Total Garbage. πŸ—‘οΈ 😒

Why aren't these status checks all automatic?! πŸ€¦β€β™‚οΈ Surely the Web Interface could inform us that the health of the Postgres is "no active leader". πŸ€·β€β™‚οΈ

nelsonic commented 3 months ago

Interrupted by children. πŸ§’ ❀️ So no further progress will be made on this today. Wish I could clone myself and be a good parent and still work on "side projects" ... πŸ’­

nelsonic commented 3 months ago
==> ecto
Compiling 56 files (.ex)
     warning: missing parentheses for expression following "do:" keyword. 
      Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:
     β”‚
 568 β”‚     do: raise ArgumentError, "cast/3 expects a list of atom keys, got: `#{inspect key}`"
     β”‚     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     β”‚
     └─ lib/ecto/changeset.ex:568

     warning: missing parentheses for expression following "do:" keyword. 
     Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
     Or if the keywords "do" and "else" apply to the function "function" or "if". 
     You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:
     β”‚
 445 β”‚     do: raise ArgumentError,
     β”‚     ~~~~~~~~~~~~~~~~~~~~~~~~
     β”‚
     └─ lib/ecto/association.ex:445

      warning: missing parentheses for expression following "do:" keyword. 
      Parentheses are required to solve ambiguity inside keywords.

      This error happens when you have function calls without parentheses inside keywords. For example:

          function(arg, one: nested_call a, b, c)
          function(arg, one: if expr, do: :this, else: :that)

      In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

          function(arg, one: if(expr, do: :this, else: :that))
          function(arg, one: nested_call(a, b, c))

      Ambiguity found at:
      β”‚
 2904 β”‚     raise ArgumentError, "cannot add constraint to changeset because it does not have a source, got: #{inspect data}"
      β”‚     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      β”‚
      └─ lib/ecto/changeset.ex:2904

      warning: missing parentheses for expression following "do:" keyword. Parentheses are required to solve ambiguity inside keywords.

      This error happens when you have function calls without parentheses inside keywords. For example:

          function(arg, one: nested_call a, b, c)
          function(arg, one: if expr, do: :this, else: :that)

      In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

          function(arg, one: if(expr, do: :this, else: :that))
          function(arg, one: nested_call(a, b, c))

      Ambiguity found at:
      β”‚
 2906 β”‚     raise ArgumentError, "cannot add constraint because a changeset was not supplied, got: #{inspect item}"
      β”‚     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      β”‚
      └─ lib/ecto/changeset.ex:2906

     warning: atom ::: must be written between quotes, as in :"::", to avoid ambiguity
     β”‚
 260 β”‚       {:::, _, [left, _]} -> is_integer(left) or is_binary(left)
     β”‚        ~
     β”‚
     └─ lib/ecto/query/builder.ex:260:8

     warning: missing parentheses for expression following "do:" keyword. 
    Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
     Or if the keywords "do" and "else" apply to the function "function" or "if". 
     You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:
     β”‚
 621 β”‚     do: raise ArgumentError, "a changeset with action #{inspect given} was given to #{inspect repo}.#{action}/2"
     β”‚     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     β”‚
     └─ lib/ecto/repo/schema.ex:621

== Compilation error in file lib/ecto/query.ex ==
** (Kernel.TypespecError) lib/ecto/query.ex:428: type dynamic/0 is a built-in type and it cannot be redefined
    (elixir 1.17.2) lib/kernel/typespec.ex:969: Kernel.Typespec.compile_error/2
    (stdlib 5.2.3) lists.erl:1599: :lists.foldl_1/3
    (elixir 1.17.2) lib/kernel/typespec.ex:226: Kernel.Typespec.translate_typespecs_for_module/2
could not compile dependency :ecto, "mix compile" failed. Errors may have been logged above. 
You can recompile this dependency with "mix deps.compile ecto --force", 
update it with "mix deps.update ecto" or clean it with "mix deps.clean ecto"

I obviously tried updating ecto but that didn't do anything. πŸ˜•