Open nelsonic opened 1 year ago
Suspect it has something to do with this email received from fly.io
16 hours ago:
Also confirm this error on my end π’
Lame. Thanks fly.io
. The reward for being a loyal+paying customer... π€¦ββοΈ
Thankfully we have a "Staging" instance that still works flawlessly: https://authdemo.fly.dev/
So I'm temporarily switching over to that on the MVP until we can recover authprod
.
Just re-deploying now because the machine restart didn't work to update the environment variable. π
MVP now redirects to authdemo:
Works for me:
@iteles / @LuchoTurtle please confirm you're able to access your items
using the authdemo
login.
Thanks.
Definitely still going to try and recover the authprod
instance when I have a bit more time. (maybe only tomorrow)
Interesting. It's working for me on Firefox desktop!
On Chrome desktop, I'm still getting an internal server error straight away (on the start page, not even getting to the login portion) even after clearing cookies and cached data π€·ββοΈ
But fixed for now, no point troubleshooting this further until we're back on prod
.
@iteles In Chrome, if you visit https://mvp.fly.dev/ what do you see?
Well that's mega frustrating! Thank you Google Chrome
! π€¦ββοΈ
But it's useful feedback that someone using the App could encounter the same error. π π
Glad you proactively tested it in Firefox
. π
Can you still see all your items
?
Works on my phone:
but doesnβt look great. Some tidy up to be done. β³
Works on Chrome on my phone but I can't see any of my items.
Ok. So that means your items
are tied to the person.id
from authprod
. π€¦ββοΈ
This is very useful insight. Need to make the person.id
a cid
based on the authprovider -> userid
π
Only works if I clear my cookies. I'm doing this on the web. Although it now works after clearing them, all of my items are gone as well.
Trying to re-deploy authprod
... π€
n@MBP auth % fly deploy --verbose
==> Verifying app config
Validating /auth/fly.toml
Platform: machines
β Configuration is valid
--> Verified app config
==> Building image
Waiting for remote builder fly-builder-solitary-rain-5830... π
Useless! π’
WARN Remote builder did not start in time. Check remote builder logs with `flyctl logs -a fly-builder-solitary-rain-5830`
WARN Failed to start remote builder heartbeat: remote builder app unavailable
WARN failed to create build in graphql: returned error 504 Gateway Timeout: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Error: failed to fetch an image or build from source: error connecting to docker: server returned a non-200 status code: 503
Web UI is unresponsive: https://fly.io/apps/authprod π€·ββοΈ π’
Gotta get going to Lx
now. Will resume trying to figure this out later tonight. π€
Cannot access any of Fly.io
dashboard now ...
"Something went wrong" ... indeed. π€¦ββοΈ
Right. For the first time in a very long time I'm away from @home (family) π’ and have a couple of hours to focus. π§βπ» π―
Going to attempt to recover the authprodb
and see if I can get the instance running again. π€
Will document every step along the way (as always). π
Wish me luck! π
Ok. Step 1: Attempt to create a backup of authprodb
in case any of this goes wrong. π
Following this helpful article: https://fly.io/docs/postgres/managing/backup-and-restore/
fly volumes list -a authprodb
ID STATE NAME SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
vol_g2yxp4mllj6v63qd created pg_data 10GB lhr 34c3 false 2 years ago
vol_1g67340ggznrydxw created pg_data 10GB lhr 805e false 2 years ago
ATTACHED VM: blank. There's your problem ...
Note: "ECNRYPTED: false" ... Not encrypted by Fly.io. But all the sensitive data is encrypted using
fields
.
List the snapshots:
fly volumes snapshots list vol_g2yxp4mllj6v63qd
Snapshots
ID SIZE CREATED AT
vs_QGPN6pQObpLQpUma 96479939 10 hours ago
vs_za8NQRN70Dmnat9K 96479939 1 day ago
vs_8zZLpNxD3VB9Qhom 95992552 2 days ago
vs_OANN6OZkLJLNfz7 95895907 3 days ago
vs_GpqzBBJgL7DmltGK 95019181 4 days ago
vs_bjjMlxqMzXjl7sB 94680799 5 days ago
Dunno why it's continuing to take "dumb" snapshots if the data isn't changing. π€·ββοΈ
So that guide doesn't actually deliver on "backing up" the DB. So I searched for: https://www.google.com/search?q=fly.io+postgres+backup+to+localhost
And the first hit is: https://community.fly.io/t/backup-and-restore-postgresql-in-fly-io-step-by-step/11455
No formatting of the input/output and using Windows π ... But it's better than nothing. Going to follow the steps.
First thing I had to do was find out what the Postgres
password is for the authprodb
instance as I don't have in any password manager.
https://community.fly.io/t/lost-postgres-password/4192/2
fly ssh console
echo $DATABASE_URL
postgres://authprod:redacted@top2.nearest.of.authprodb.internal:5432/authprod
This also tells us that the internal TCP port used is 5432
. π
Attempt to proxy the internal TCP port to a local one:
flyctl proxy 15432:5432 -a authprodb
Sadly, we get:
Error: authprodb.internal: host was not found in DNS
Not good. How about connecting directly to the Postgres VM?
fly ssh console -a authprodb
Got:
Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:
fly status
Tried:
fly status -a authprodb
got:
No machines are available on this app authprodb
But the Fly.io Web interface says it's "deployed": https://fly.io/apps/authprodb π€·ββοΈ
But the instance doesn't have an IP Address so it's not addressable ... π π Going to try and temporarily allocate an IP V4 Address: https://fly.io/docs/flyctl/ips/#usage
flyctl ips allocate-v4 -a authprodb
Prompt:
? Looks like you're accessing a paid feature. Dedicated IPv4 addresses now cost $2/mo. Are you ok with this? (y/N)
Sure, why not? If $2/month resolves this (and I will immediately remove the IP address) let's go!
Appears to have worked:
VERSION IP TYPE REGION CREATED AT
v4 137.66.62.95 public global 19s ago
137.66.62.95
is now listed in the IP Addresses section:
Let's try connecting to it again. π€
fly ssh console -a authprodb
No started VMs:
Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:
fly status
Great. So it allows me to attach a paid IP address to a VM that is "deployed" but not reachable. π (π€¦ββοΈ)
Given that it's useless to us, let's release that IPV4 address before we forget:
flyctl ips release 137.66.62.95
Confirmed:
Released 137.66.62.95 from authprod
Going to try a different approach. Let's grab the latest snapshot and create a new Postgres
instance from it! π π€
That way if I mess up, it's just on a new VM that can easily be dropped.
Recall from above that the latest snapshot ID was vs_QGPN6pQObpLQpUma
β¬οΈ
Run:
fly postgres create --snapshot-id vs_QGPN6pQObpLQpUma
But got:
Error: "Failed to resolve source machine from snapshot."
Try to fork the db:
fly postgres create --fork-from authprodb
Fails:
Error: No machines associated with fork-from target. See `fly pg create --help` for more information
Helpful. π€¦ββοΈ
Reading: https://fly.io/docs/postgres/postgres-on-nomad/#upgrading-the-postgres-app
Tried:
flyctl image show -a authprodb
Output:
MACHINE ID REGISTRY REPOSITORY TAG VERSION DIGEST LABELS
Reading this very recent forum topic: https://community.fly.io/t/postgres-crashed-cannot-restart-restore-or-fork/16186
Sadly not a very helpful comment from the OP: π
Read: https://fly.io/docs/flyctl/postgres-restart/
Tried:
flyctl postgres restart -a authprodb
Got:
Error: no active leader found
Read: https://fly.io/docs/apps/restart/
Tried restarting:
fly apps restart authprodb
Got:
Error: postgres apps should use `fly pg restart` instead
Great! Thanks fly
... π€¦ββοΈ
Trying to "deploy" a machine:
fly deploy -a authprodb --verbose
Got the following output + error:
-------
β release_command failed
-------
Error release_command failed running on machine 1857439c5996d8 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 1857439c5996d8':
Pulling container image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ
Successfully prepared image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ (4.634590051s)
Configuring firecracker
[ 0.042013] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
[ 0.047890] PCI: Fatal: No config space access function found
INFO Starting init (commit: 15238e9)...
INFO Preparing to run: `/app/bin/migrate` as nobody
INFO [fly api proxy] listening at /.fly/api
2023/11/03 07:22:14 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
ERROR! Config provider Config.Reader failed with:
** (RuntimeError) environment variable DATABASE_URL is missing.
For example: ecto://USER:PASS@HOST/DATABASE
(stdlib 3.16.1) erl_eval.erl:685: :erl_eval.do_apply/6
(stdlib 3.16.1) erl_eval.erl:446: :erl_eval.expr/5
(stdlib 3.16.1) erl_eval.erl:123: :erl_eval.exprs/5
(elixir 1.12.3) src/elixir.erl:280: :elixir.recur_eval/3
(elixir 1.12.3) src/elixir.erl:265: :elixir.eval_forms/3
(elixir 1.12.3) lib/code.ex:656: Code.eval_string_with_error_handling/3
(elixir 1.12.3) lib/config.ex:258: Config.__eval__!/3
(elixir 1.12.3) lib/config/reader.ex:86: Config.Reader.read!/2
{"init terminating in do_boot",{#{'__exception__'=>true,'__struct__'=>'Elixir.RuntimeError',message=><<101,110,118,105,114,111,110,109,101,110,116,32,118,97,114,105,97,98,108,101,32,68,65,84,65,66,65,83,69,95,85,82,76,32,105,115,32,109,105,115,115,105,110,103,46,10,70,111,114,32,101,120,97,109,112,108,101,58,32,101,99,116,111,58,47,47,85,83,69,82,58,80,65,83,83,64,72,79,83,84,47,68,65,84,65,66,65,83,69,10>>},[{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,685}]},{erl_eval,expr,5,[{file,"erl_eval.erl"},{line,446}]},{erl_eval,exprs,5,[{file,"erl_eval.erl"},{line,123}]},{elixir,recur_eval,3,[{file,"src/elixir.erl"},{line,280}]},{elixir,eval_forms,3,[{file,"src/elixir.erl"},{line,265}]},{'Elixir.Code',eval_string_with_error_handling,3,[{file,"lib/code.ex"},{line,656}]},{'Elixir.Config','__eval__!',3,[{file,"lib/config.ex"},{line,258}]},{'Elixir.Config.Reader','read!',2,[{file,"lib/config/reader.ex"},{line,86}]}]}}
init terminating in do_boot ({,[{erl_eval,do_apply,6,[{_},{_}]},{erl_eval,expr,5,[{_},{_}]},{erl_eval,exprs,5,[{_},{_}]},{elixir,recur_eval,3,[{_},{_}]},{elixir,eval_forms,3,[{_},{_}]},{Elixir.Code,ev
Crash dump is being written to: erl_crash.dump...done
INFO Main child exited normally with code: 1
INFO Starting clean up.
WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
2023/11/03 07:22:15 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
[ 2.293675] reboot: Restarting system
machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 1857439c5996d8 exited with non-zero status of 1
Dunno why it's trying to run the release command ... obviously it's going to fail!!
Trying to fork from the volume:
fly postgres create --fork-from authprodb:vol_g2yxp4mllj6v63qd
Output:
Creating app...
Setting secrets on app authprodb2...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:5e5fc53decb051f69b0850f0f5d137c92343fcd1131ec413015e5260620947ac
Waiting for machine to start...
Machine 1857509c162168 is created
==> Monitoring health checks
Waiting for 1857509c162168 to become healthy (started, 3/3)
Postgres cluster authprodb2 created
Username: postgres
Password: redacted
Hostname: authprodb2.internal
Flycast: fdaa:0:42c6:0:1::6
Proxy port: 5432
Postgres port: 5433
Connection string: postgres://postgres:redacted@authprodb2.flycast:5432
Save your credentials in a secure place -- you won't be able to see them again!
Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string
Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/
That appears to have worked:
Reading: https://fly.io/docs/postgres/getting-started/what-you-should-know/ Recommends using a different service if you want managed Postgres ... Rabbit hole ...
Back to getting this working!
Tried:
flyctl postgres attach authprodb2 -a authprod
Got:
Error: consumer app "authprod" already contains a secret named DATABASE_URL
Duh.
Let's quickly connect to the newly created authprodb2
and see if it has data:
fly postgres connect -a authprodb2
List the databases:
\l
List of databases
Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc |
repmgr | repmgr | UTF8 | en_US.utf8 | en_US.utf8 | | libc |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
(4 rows)
None of these look like what we need. π
Going to need to take a step back ... π
Connecting to the (working) demo
version of auth
Postgres instance:
fly postgres connect -a authdemo-db
We see that the databases are:
\l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+------------+----------+------------+------------+---------------------------
authdemo | flypgadmin | UTF8 | en_US.utf8 | en_US.utf8 |
postgres | flypgadmin | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | flypgadmin | UTF8 | en_US.utf8 | en_US.utf8 | =c/flypgadmin +
| | | | | flypgadmin=CTc/flypgadmin
template1 | flypgadmin | UTF8 | en_US.utf8 | en_US.utf8 | =c/flypgadmin +
| | | | | flypgadmin=CTc/flypgadmin
Instead of the repmgr
there is an authdemo
database. π
As per the echo $DATABASE_URL
above:
postgres://authprod:redacted@top2.nearest.of.authprodb.internal:5432/authprod
We're expecting a database called authprod
to exist in the newly created Postgres instance. π€·ββοΈ
This got me scratching my head ... π€
If we look at the https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db
is using 182 MB
the available volume:
Whereas the authprodb
volumes https://fly.io/apps/authprodb/volumes are using 0mb
...
Going to contact Fly.io support.
Just in case I was going crazy, I tried to deploy the authprod
app one last time:
fly deploy --verbose
Output:
==> Verifying app config
Validating /auth/fly.toml
Platform: machines
β Configuration is valid
--> Verified app config
==> Building image
Remote builder fly-builder-solitary-rain-5830 ready
==> Building image with Docker
--> docker host: 20.10.12 linux x86_64
[+] Building 1.8s (30/30) FINISHED
=> [internal] load build definition from Dockerfile 1.0s
=> => transferring dockerfile: 32B 1.0s
=> [internal] load .dockerignore 1.0s
=> => transferring context: 34B 1.0s
=> [internal] load metadata for docker.io/library/debian:bullseye-20210902-slim 0.3s
=> [internal] load metadata for docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim 0.4s
=> [internal] load build context 0.3s
=> => transferring context: 29.65kB 0.3s
=> [stage-1 1/6] FROM docker.io/library/debian:bullseye-20210902-slim@sha256:e3ed4be20c22a1358020358331d177aa2860632f25b21681d79204ace20455a6 0.0s
=> [builder 1/18] FROM docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim@sha256:a2447f82e9125abb7edb0708eb51da638d8f40e7fda4b3ed5151bee883d476be 0.0s
=> CACHED [stage-1 2/6] RUN apt-get update -y && apt-get install -y libstdc++6 openssl libncurses5 locales && apt-get clean && rm -f /var/lib/apt/lists/*_* 0.0s
=> CACHED [stage-1 3/6] RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen 0.0s
=> CACHED [stage-1 4/6] WORKDIR /app 0.0s
=> CACHED [stage-1 5/6] RUN chown nobody /app 0.0s
=> CACHED [builder 2/18] RUN apt-get update -y && apt-get install -y build-essential git && apt-get clean && rm -f /var/lib/apt/lists/*_* 0.0s
=> CACHED [builder 3/18] WORKDIR /app 0.0s
=> CACHED [builder 4/18] RUN mix local.hex --force && mix local.rebar --force 0.0s
=> CACHED [builder 5/18] COPY .env_sample ./ 0.0s
=> CACHED [builder 6/18] COPY mix.exs mix.lock ./ 0.0s
=> CACHED [builder 7/18] RUN mix deps.get --only prod 0.0s
=> CACHED [builder 8/18] RUN mkdir config 0.0s
=> CACHED [builder 9/18] COPY config/config.exs config/prod.exs config/ 0.0s
=> CACHED [builder 10/18] RUN mix deps.compile 0.0s
=> CACHED [builder 11/18] COPY priv priv 0.0s
=> CACHED [builder 12/18] COPY assets assets 0.0s
=> CACHED [builder 13/18] RUN mix assets.deploy 0.0s
=> CACHED [builder 14/18] COPY lib lib 0.0s
=> CACHED [builder 15/18] RUN mix compile 0.0s
=> CACHED [builder 16/18] COPY config/runtime.exs config/ 0.0s
=> CACHED [builder 17/18] COPY rel rel 0.0s
=> CACHED [builder 18/18] RUN mix release 0.0s
=> CACHED [stage-1 6/6] COPY --from=builder --chown=nobody:root /app/_build/prod/rel/auth ./ 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:5ba72f99464041b84ed2e1491eb709c1518dff88852706229e8d82f0f3af2d4c 0.0s
=> => naming to registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02 0.0s
--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/authprod]
7b5d131176f6: Layer already exists
4d5c3ee4879b: Layer already exists
0166b8c07570: Layer already exists
4dfd7d2f8a33: Layer already exists
11a55e4db61e: Layer already exists
ba5a5fe43301: Layer already exists
deployment-01HEADJTX39GT10NA7TY49HG02: digest: sha256:f5d586ae7234d29f8ef485ce6c06e5ec469f6ae66d77886459780fc172337147 size: 1575
--> Pushing image done
image: registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
image size: 120 MB
Watch your deployment at https://fly.io/apps/authprod/monitoring
Running authprod release_command: /app/bin/migrate
-------
β release_command failed
-------
Error release_command failed running on machine 48ed037a343368 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 48ed037a343368':
Pulling container image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
Successfully prepared image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02 (5.834199334s)
[ 0.041401] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
[ 0.042773] PCI: Fatal: No config space access function found
INFO Starting init (commit: 15238e9)...
INFO Preparing to run: `/app/bin/migrate` as nobody
INFO [fly api proxy] listening at /.fly/api
2023/11/03 11:02:57 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
11:02:58.710 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
11:02:58.710 [error] Postgrex.Protocol (#PID<0.135.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
11:02:59.943 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
11:03:00.660 [error] Could not create schema migrations table. This error usually happens due to the following:
* The database does not exist
* The "schema_migrations" table, which Ecto uses for managing
migrations, was defined by another library
* There is a deadlock while migrating (such as using concurrent
indexes with a migration_lock)
To fix the first issue, run "mix ecto.create".
To address the second, you can run "mix ecto.drop" followed by
"mix ecto.create". Alternatively you may configure Ecto to use
another table and/or repository for managing migrations:
config :auth, Auth.Repo,
migration_source: "some_other_table_for_schema_migrations",
migration_repo: AnotherRepoForSchemaMigrations
The full error report is shown below.
** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2975ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
1. Ensuring your database is available and that you can connect to it
2. Tracking down slow queries and making sure they are running fast enough
3. Increasing the pool_size (although this increases resource consumption)
4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
See DBConnection.start_link/2 for more information
(ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:760: Ecto.Adapters.SQL.raise_sql_call_error/1
(elixir 1.12.3) lib/enum.ex:1582: Enum."-map/2-lists^map/1-0-"/2
(ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:852: Ecto.Adapters.SQL.execute_ddl/4
(ecto_sql 3.7.2) lib/ecto/migrator.ex:678: Ecto.Migrator.verbose_schema_migration/3
(ecto_sql 3.7.2) lib/ecto/migrator.ex:504: Ecto.Migrator.lock_for_migrations/4
(ecto_sql 3.7.2) lib/ecto/migrator.ex:419: Ecto.Migrator.run/4
(ecto_sql 3.7.2) lib/ecto/migrator.ex:146: Ecto.Migrator.with_repo/3
(auth 1.6.6) lib/auth/release.ex:12: anonymous fn/2 in Auth.Release.migrate/0
INFO Main child exited normally with code: 1
WARN Reaped child process with pid: 359 and signal: SIGUSR1, core dumped? false
INFO Starting clean up.
WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)
2023/11/03 11:03:01 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
[ 5.255192] reboot: Restarting system
machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1
fly deploy
definitely doesn't work because it cannot run the mix release
command because there is no Postgres
instance available to migrate. π
Posted Forum topic: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247
Snapshot:
Full text of my forum topic in case it gets deleted:
Opening this question in case anyone `else` is in the same boat (having their `Nomad`-based `Postgres` instance killed ...)
# Context
We deployed our [`auth`](https://github.com/dwyl/auth) (`Elixir/Phoenix`) app to Fly.io in 2022. :rocket:
It was a good experience and worked flawlessly until they killed it. :neutral_face:
# Fly.io App Disabled π
We received the following email from Fly.io:
![image|690x464](upload://xgMVmyDeZnevWHqqpE3bKQLJFHV.png)
## `Fulltext` of the email for Google indexing/findability π§
```md
Hi. Weβre in the process of removing Nomad from Fly.io. This means any apps that havenβt been upgraded to Apps V2 will no longer function.
Your app authprodb is one of those apps. Hereβs what this means for you:
1. authprodb is currently disabled.
- That means the Nomad instances (virtual machines)
that ran your app have been shut down, but that the appβs configuration still exists.
2. Your app is technically now on the βmachinesβ platform, rather than βnomad.β
- That means that any subsequent fly deploy on your app will attempt to launch Fly Machines.
- In many cases, a deploy will immediately work. In some cases, there may be some additional configuration required to get your app running on Machines.
The community forum is a great place to get help troubleshooting your deployment if you get stuck.
3. Nothing has been deleted.
- Your appβs Docker image still exists, and any Volumes that were attached to your app still exist.
Thanks for bearing with us throughout this whole migration saga. Itβs not been easy, but the results should soon speak for themselves.
If you have any questions or comments, weβve posted a similar announcement on the community forum.
Thanks, Your Friends at Fly.io
They key point of the email:
they shut down our Postgres
instance
and according to point 3 "Nothing has been deleted".
So I dived into trying to recover the database.
Tried re-deploying the authprod
App that was attached to the authprodb
Postgres instance:
fly deploy --verbose
It errors because the Elixir
App cannot run the mix release
command:
Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1
Full log output in: https://github.com/dwyl/auth/issues/325#issuecomment-1792265725
After a couple of hours searching and trying everything I could think of with zero progress. I'm opening this forum topic.
0 MB
?Viewing the volumes for the authprodb
, they have Used
0 MB
i.e. no data
https://fly.io/apps/authprodb/volumes
Luckily, we have a working "staging" version of the auth
App also hosted on Fly.io: https://authdemo.fly.dev/
and can easily compare the Volume: https://fly.io/apps/authdemo-db/volumes
we see that the authdemo-db
is using 182 MB
the available volume:
This suggests that our data was in fact deleted when Fly.io disabled our App.
If anyone has seen this issue and can help us recover our Postgres
data, please comment! :pray:
Replied to the support email:
Ball is in their court. πΎ
I get the sinking feeling that the data for authprodb
is lost. π
Which is going to be a huge headache. π
But let's wait and see what they say. β³
AFK to help parents with something for the next couple of hours. π€
Meanwhile I've wasted my whole morning on this ... β³ π₯ π’
Followed-up with support@fly.io
by email. If the data for authprodb
is in fact gone, we need to move on. π
Got a couple of replies on the forum: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/5
Tried creating a fork from the second volume:
fly postgres create --fork-from authprodb:vol_1g67340ggznrydxw
Output:
? Select VM size: shared-cpu-1x - CPU Kind: Shared, vCPUs: 1 Memory: 256MB
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app authprodb3...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6
Waiting for machine to start...
Machine e82d377a069728 is created
==> Monitoring health checks
Waiting for e82d377a069728 to become healthy (started, 3/3)
Postgres cluster authprodb3 created
Username: postgres
Password: redacted
Hostname: authprodb3.internal
Flycast: fdaa:0:42c6:0:1::7
Proxy port: 5432
Postgres port: 5433
Connection string: postgres://postgres:redacted@authprodb3.flycast:5432
Save your credentials in a secure place -- you won't be able to see them again!
Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string
Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/
Connect:
fly postgres connect -a authprodb3
\l
List of databases
Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc |
repmgr | repmgr | UTF8 | en_US.utf8 | en_US.utf8 | | libc |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | | libc | =c/postgres +
| | | | | | | postgres=CTc/postgres
(4 rows)
I replied on the thread.
Got a reply on the forum (but still nothing in email ...) https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/8
Only saw it now ... didn't get a notification from the forum. We need a "Reminders" feature.
Going to try the command now:
fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region CHOOSE --org your-org --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_yaddayaddayadda
Re-ran the command to check the snapshots list
:
fly volume snapshots list vol_g2yxp4mllj6v63qd
Snapshots
ID SIZE CREATED AT
vs_vYXP8wx6ezM53CR 96479939 11 hours ago
vs_vnAvN76g36Apwt2A 96479939 1 day ago
vs_azyMXoAexXOG9tA7 96479939 2 days ago
vs_JjO3bbl7nbV2oSX5 96479939 3 days ago
vs_VD9Zm3LRVqQ6lfz7 96479939 4 days ago
vs_gnkGR2jXV7yM3sqN 96479939 5 days ago
Using the latest snapshot: vs_vYXP8wx6ezM53CR
fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Gotta go be a Dad for a few hours. BRB. β³
fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine e784434c420ed8 is created
==> Monitoring health checks
Waiting for e784434c420ed8 to become healthy (started, 0/3)
Error: context deadline exceeded
great. going to have to run the command again ...
Posted a comment on the forum topic to this effect: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/9
Not much else
I can do on this right now.
Other than creating a new empty DB and starting again. π
Given the error above, I had deleted new-db
to avoid incurring unnecessary charges.
So I had to re-run the command to re-create new-db
:
fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_jXlNXQLxgLDeXclV
But sadly, Fly.io is not fully operational right now: https://fly.io/dashboard/dwyl-auth/status
So the response is:
Creating app...
Error: server returned a non-200 status code: 504
Will need to revisit this in the morning. β³
At home now and still getting error trying to revive the DB:
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine 784e679b453408 is created
==> Monitoring health checks
Waiting for 784e679b453408 to become healthy (started, 0/3)
Error: context deadline exceeded
But when visiting the dashboard, we see that the new-db
was created:
https://fly.io/dashboard/dwyl-auth-546
Replied on the thread: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/16?u=dwyl_auth
On the advice of the Daniel https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/18 I created a new DB instance to attempt to recover the data ...
https://fly.io/dashboard/dwyl-auth-546
It didn't work and the thread was automatically closed without any resolution. π’ But the instance is still running and costing us a small fortune!! So I'm killing it now:
The prod
instance is still not recovered. π’
Really wish I had someone to pair on this ...
Going to attempt to host the Postgres
on DigitalOcean
https://github.com/dwyl/learn-devops/issues/90 with the Auth
App still running on Fly.io
ref: https://github.com/dwyl/learn-devops/issues/93
Meanwhile our Auth App has a mountain of credits: https://fly.io/dashboard/dwyl-auth-546
And since the first $5/month
is "free", we aren't consuming them.
So I might bump the App up a notch once the DB transferred to DigitalOcean
.
But just to say I'm tackling this now! β³ (or at least investing an hour into this and seeing how far I get ...)
fly auth login
Opening https://fly.io/app/auth/cli/6c617a6637656868646571 ...
Waiting for session... Done
successfully logged in as nelsonic
fly status
App
Name = authprod
Owner = dwyl-auth-546
Hostname = authprod.fly.dev
Image = authprod:deployment-01GK3WCNQ8Q374
Machines
PROCESS ID VERSION REGION STATE ROLE CHECKS LAST UPDATED
app 6e82d924ad3742 58 lhr started 1 total, 1 passing 2023-11-03T07:20:03Z
All running. π
Going to detach the (empty) DB: https://fly.io/docs/postgres/managing/attach-detach/#detach-a-fly-app
Example:
fly postgres detach <postgres app name> --app <app name>
Ours:
fly postgres detach authprodb --app authprod
Got:
Error: no 6pn ips founds for authprodb app
Didin't expect it to be easy ... π
But what is this
error? π€·ββοΈ
Exact error message for forum title/topic: https://community.fly.io/t/6pn-ips-are-not-static-re-fly-postgres-last-thurs/20022
Reading now ... β³
Turns out the authprod
App is not connected to the authprodb
:
https://fly.io/docs/postgres/managing/attach-detach/#check-if-a-fly-app-is-attached-to-a-postgres-app
fly postgres users list --app authprodb
Error: no active leader found
Fly Postgres is Total Garbage. ποΈ π’
Why aren't these status checks all automatic?! π€¦ββοΈ
Surely the Web Interface could inform us that the health of the Postgres
is "no active leader". π€·ββοΈ
Interrupted by children. π§ β€οΈ So no further progress will be made on this today. Wish I could clone myself and be a good parent and still work on "side projects" ... π
==> ecto
Compiling 56 files (.ex)
warning: missing parentheses for expression following "do:" keyword.
Parentheses are required to solve ambiguity inside keywords.
This error happens when you have function calls without parentheses inside keywords. For example:
function(arg, one: nested_call a, b, c)
function(arg, one: if expr, do: :this, else: :that)
In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call".
Or if the keywords "do" and "else" apply to the function "function" or "if".
You can solve this by explicitly adding parentheses:
function(arg, one: if(expr, do: :this, else: :that))
function(arg, one: nested_call(a, b, c))
Ambiguity found at:
β
568 β do: raise ArgumentError, "cast/3 expects a list of atom keys, got: `#{inspect key}`"
β ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
β
ββ lib/ecto/changeset.ex:568
warning: missing parentheses for expression following "do:" keyword.
Parentheses are required to solve ambiguity inside keywords.
This error happens when you have function calls without parentheses inside keywords. For example:
function(arg, one: nested_call a, b, c)
function(arg, one: if expr, do: :this, else: :that)
In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call".
Or if the keywords "do" and "else" apply to the function "function" or "if".
You can solve this by explicitly adding parentheses:
function(arg, one: if(expr, do: :this, else: :that))
function(arg, one: nested_call(a, b, c))
Ambiguity found at:
β
445 β do: raise ArgumentError,
β ~~~~~~~~~~~~~~~~~~~~~~~~
β
ββ lib/ecto/association.ex:445
warning: missing parentheses for expression following "do:" keyword.
Parentheses are required to solve ambiguity inside keywords.
This error happens when you have function calls without parentheses inside keywords. For example:
function(arg, one: nested_call a, b, c)
function(arg, one: if expr, do: :this, else: :that)
In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call".
Or if the keywords "do" and "else" apply to the function "function" or "if".
You can solve this by explicitly adding parentheses:
function(arg, one: if(expr, do: :this, else: :that))
function(arg, one: nested_call(a, b, c))
Ambiguity found at:
β
2904 β raise ArgumentError, "cannot add constraint to changeset because it does not have a source, got: #{inspect data}"
β ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
β
ββ lib/ecto/changeset.ex:2904
warning: missing parentheses for expression following "do:" keyword. Parentheses are required to solve ambiguity inside keywords.
This error happens when you have function calls without parentheses inside keywords. For example:
function(arg, one: nested_call a, b, c)
function(arg, one: if expr, do: :this, else: :that)
In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call".
Or if the keywords "do" and "else" apply to the function "function" or "if".
You can solve this by explicitly adding parentheses:
function(arg, one: if(expr, do: :this, else: :that))
function(arg, one: nested_call(a, b, c))
Ambiguity found at:
β
2906 β raise ArgumentError, "cannot add constraint because a changeset was not supplied, got: #{inspect item}"
β ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
β
ββ lib/ecto/changeset.ex:2906
warning: atom ::: must be written between quotes, as in :"::", to avoid ambiguity
β
260 β {:::, _, [left, _]} -> is_integer(left) or is_binary(left)
β ~
β
ββ lib/ecto/query/builder.ex:260:8
warning: missing parentheses for expression following "do:" keyword.
Parentheses are required to solve ambiguity inside keywords.
This error happens when you have function calls without parentheses inside keywords. For example:
function(arg, one: nested_call a, b, c)
function(arg, one: if expr, do: :this, else: :that)
In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call".
Or if the keywords "do" and "else" apply to the function "function" or "if".
You can solve this by explicitly adding parentheses:
function(arg, one: if(expr, do: :this, else: :that))
function(arg, one: nested_call(a, b, c))
Ambiguity found at:
β
621 β do: raise ArgumentError, "a changeset with action #{inspect given} was given to #{inspect repo}.#{action}/2"
β ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
β
ββ lib/ecto/repo/schema.ex:621
== Compilation error in file lib/ecto/query.ex ==
** (Kernel.TypespecError) lib/ecto/query.ex:428: type dynamic/0 is a built-in type and it cannot be redefined
(elixir 1.17.2) lib/kernel/typespec.ex:969: Kernel.Typespec.compile_error/2
(stdlib 5.2.3) lists.erl:1599: :lists.foldl_1/3
(elixir 1.17.2) lib/kernel/typespec.ex:226: Kernel.Typespec.translate_typespecs_for_module/2
could not compile dependency :ecto, "mix compile" failed. Errors may have been logged above.
You can recompile this dependency with "mix deps.compile ecto --force",
update it with "mix deps.update ecto" or clean it with "mix deps.clean ecto"
I obviously tried updating ecto
but that didn't do anything. π
Is anyone else seeing an Internal Server Error on https://auth.dwyl.com/ ??
Todo
authprod
app. π€