BREAKING CHANGE: drop python launcher (use binary launcher)

reliveyy commented 3 years ago

This PR removes utils container usage in setup.sh and replace it with binary launcher from xud-launcher project.

How to test?

Run launcher branch on major platforms will bring up the most familiar setup flow into your terminal.

On Linux and macOS:

$ BRANCH=launcher
$ bash xud.sh --proxy.image=exchangeunion/proxy:latest__launcher --xud.image=exchangeunion/xud:1.2.4__launcher setup

On Windows:

>set BRANCH=launcher
>powershell.exe xud.ps1 --proxy.image=exchangeunion/proxy:latest__launcher --xud.image=exchangeunion/xud:1.2.4__launcher setup

Build and test locally:

cd launcher
make
NETWORK=simnet
./launcher setup
./launcher cleanup

N.B. We need to use new proxy image becuase of new "attach mode" and endpoints (/api/v1/info, /api/v1/backup, /api/v1/xud/changepass) introduced. We need a new 1.2.4 image for xud becuase of the backup location fix (removing /mnt/hostfs)

Todos

[x] Launcher attach mode (using WebSocket) -> full featured xud-ctl console in proxy
- /api/v1/info
- /api/v1/setup-status
[x] Fix GitHub actions
[x] ~~Implement xud.ps1~~
[x] Sync with latest boltz (1.2.2)
[x] Sync with latest connext (aeb14b49)
[x] Release proxy:1.3.0 (attach mode & new endpoints)
[ ] Release xud:1.2.4-1 (backup to fixed dir in container /root/backup)
[x] Modify mainnet xud and proxy version before merge

Bugs

[ ] Simnet connext container exited
[ ] Windows Docker engine has no response after setup for a while (the reason is still not clear :()
- Command docker info got "Bad response from Docker engine"
- The Docker dashboard shows "Docker is running"
[x] Boltz shows status "btc down; ltc down" after setup

reliveyy commented 3 years ago

Lnds blocks while syncing with status "Syncing 0.00% (0/665740)". Here are some log lines about this:

Jan 12 15:10:07.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $AFF2FC5C6F793B6E147EB93C1897D6DDA49E54FD~Wix at 95.211.230.211. Retrying on a new circuit.
2021-01-12 15:10:08.759 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:08.759 [WRN] BTCN: got error attempting to determine correct cfheader checkpoints: got mismatched checkpoints, trying again
2021-01-12 15:10:12.714 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:12.714 [WRN] BTCN: Detected mismatch at index=229 for checkpoints!!!
2021-01-12 15:10:14.212 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:14.212 [WRN] BTCN: got error attempting to determine correct cfheader checkpoints: got mismatched checkpoints, trying again
2021-01-12 15:10:17.236 [INF] BTCN: Lost peer 195.201.95.119:8333 (outbound)
Jan 12 15:10:25.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $6F4E9FD00D4251D98BE96FB1AA546FE34676A95B~CalyxInstitute06 at 162.247.74.206. Retrying on a new circuit.
Jan 12 15:10:26.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $CF4872108C9F6EB9E485B79AB35D1881F9698732~libreexit06 at 209.141.33.53. Retrying on a new circuit.
2021-01-12 15:10:27.244 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:27.245 [WRN] BTCN: Detected mismatch at index=229 for checkpoints!!!
2021-01-12 15:10:28.857 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:28.857 [WRN] BTCN: got error attempting to determine correct cfheader checkpoints: got mismatched checkpoints, trying again
Jan 12 15:10:32.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $A53C46F5B157DD83366D45A8E99A244934A14C46~csailmitexit at 128.31.0.13. Retrying on a new circuit.
2021-01-12 15:10:33.016 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:33.016 [WRN] BTCN: Detected mismatch at index=229 for checkpoints!!!
2021-01-12 15:10:34.425 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:34.425 [WRN] BTCN: got error attempting to determine correct cfheader checkpoints: got mismatched checkpoints, trying again
2021-01-12 15:10:38.405 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:38.405 [WRN] BTCN: Detected mismatch at index=229 for checkpoints!!!
2021-01-12 15:10:39.531 [WRN] BTCN: mismatch at height 230000, expected 63cdbfbded0a1e310192676d2c482767ca014fc89c09d830637faa746bd969d8 got 1308d5cfc6462f877a5587fd77d7c1ab029d45e58d5175aaf8c264cee9bde760
2021-01-12 15:10:39.531 [WRN] BTCN: got error attempting to determine correct cfheader checkpoints: got mismatched checkpoints, trying again
Jan 12 15:10:41.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $0FF233C8D78A17B8DB7C8257D2E05CD5AA7C6B88~politkovskaja at 77.247.181.165. Retrying on a new circuit.

I believe this is due to the broken Tor connection and the lnd persists some wrong cfheaders so that restarting doesn't help. But good news is the Docker engine on Windows doesn't crash for a long time.

raladev commented 3 years ago

Current status (not including points from first message):

[ ] "Syncing 0.00% (0/0)" + "Syncing 0.00% (0/665740)" problem - looks like tor connection problem, i noticed that even with old utils, in such cases i just removed all data and started setup again. We need some hack for windows/ubuntu installers (maybe force restart for this container after some time if it has 0 synced blocks)
[ ] sync stucking after some time; noticed it today with lndbtc;
1. lndbtc sync stopped on 44.85%
2. i stopped to get sync updates after some time, xud-launcher just did nothing
[ ] proxy crash after setiing up current env. Steps:
1. use xud-launcher to setup env from scratch
2. remove docker containers
3. run xud launcher again

Actual result: Proxy crash

2021-01-13 12:46:38.769 [DEBUG] gin : [172.19.0.1] GET /static/js/2.5ad162db.chunk.js | 200 | 8ms
2021/01/13 12:46:39 http: TLS handshake error from 172.19.0.1:40628: remote error: tls: unknown certificate
2021-01-13 12:46:39.520 [DEBUG] : [SocketIO/2] CONNECT: RemoteAddr=172.19.0.1:40632
2021-01-13 12:46:40.035 [DEBUG] service.xud : Failed to get container status: container not found
2021-01-13 12:46:40.035 [DEBUG] gin : [172.19.0.1] GET /api/v1/status/xud | 200 | 1ms
2021-01-13 12:46:40.041 [DEBUG] ServiceManager : [Status] proxy: Ready
panic: interface conversion: interface is nil, not lnrpc.LightningClient
goroutine 349 [running]:
github.com/ExchangeUnion/xud-docker-api/service/lnd.(*RpcClient).getClient(...)
/src/service/lnd/rpc.go:55
github.com/ExchangeUnion/xud-docker-api/service/lnd.(*RpcClient).GetInfo(0xc0005535e0, 0xf91040, 0xc000406ae0, 0xc000170401, 0x11, 0x52658b)
/src/service/lnd/rpc.go:63 +0x45
github.com/ExchangeUnion/xud-docker-api/service/lnd.(*Service).GetStatus(0xc000400c00, 0xf91040, 0xc000406ae0, 0xf91040, 0xc000406ae0)
/src/service/lnd/lnd.go:149 +0xf8
github.com/ExchangeUnion/xud-docker-api/service.(*Manager).GetStatus.func1(0xc0003983c0, 0xf9bca0, 0xc000400c00, 0xc0003e6b40)
/src/service/manager.go:173 +0x119
created by github.com/ExchangeUnion/xud-docker-api/service.(*Manager).GetStatus
/src/service/manager.go:169 +0xd4

[ ] incorrect status of connext container (socket hang up). xud contains connext connection errors, but connext container is fine. Maybe we need update _1exchangeunion/xud:1.2.4__launcher image and xud-docker PR.

reliveyy commented 3 years ago

@raladev

LND syncing stucking at 0.00% ... LND syncing stucking at any percentage (not limited to Windows only. I reproduced on Linux too)

I think these two blocking cases are due to broken Tor connection. And LND do not have posibilities to recover from it. And this blocking issue seems not related to the binary launcher. So my draft idea about this is that we should leave it out of this PR and focusing on this BUG in another PR (we could try to separate Tor as a service)

proxy crash after setiing up current env.

Yes. I know the proxy service is fragile while other services recreated or restarted. I will try to fix this today.

incorrect status of connext container (socket hang up). xud contains connext connection errors, but connext container is fine. Maybe we need update _1exchangeunion/xud:1.2.4__launcher image and xud-docker PR.

(broken connext and boltz status) WIP

raladev commented 3 years ago

I think these two blocking cases are due to broken Tor connection

first one is definitely tor issue, i noticed it with old utils flow, but IMO second one is something that connected with xud-launcher because i did not saw that before.

reliveyy commented 3 years ago

first one is definitely tor issue, i noticed it with old utils flow, but IMO second one is something that connected with xud-launcher because i did not saw that before.

@raladev Yes. It's new to us and suspicious. I'll keep an eye on this.

reliveyy commented 3 years ago

The connext non-ready status is becuase xud connects to connext port 8000 instead of 5040. The boltz "btc down; ltc down" is becuase Docker API ContainerExecAttach returns error "unable to upgrade to tcp, received 200". However, docker exec mainnet_boltz_1 wrapper btc getinfo works.

reliveyy commented 3 years ago

There is another issue when I run boltz on Linux. The mapped .boltz data directory has an empty macaroons folder. But inside the container there are two macaroon files.

-rw-------    1 root     root           110 Jan 15 09:38 admin.macaroon
-rw-------    1 root     root            96 Jan 15 09:38 readonly.macaroon

I'm wondering why these two files cannot be mapped to host filesystem.

kilrau commented 3 years ago

The connext non-ready status is becuase xud connects to connext port 8000 instead of 5040.

Somehow mainnet arm64 images seem to be vector then (amd64 works fine):

Connext info:
┌─────────┬────────────────────────────────────┐
│ Status  │ connect ECONNREFUSED 10.0.3.3:8000 │
├─────────┼────────────────────────────────────┤

Anyhow, can you take care of this? @erkarl

reliveyy commented 3 years ago

The mapped .boltz data directory has an empty macaroons folder.

It's because the file is only visible to root. You need to use "sudo" on host system to see these files. And if you are a normal user you cannot share these files between two containers. That's a new problem!

reliveyy commented 3 years ago

And if you are a normal user you cannot share these files between two containers

We can use docker-compose named volumes to resolve this issue.

services:
  boltz:
    volumes:
      - boltz-data:/root/.boltz
  proxy:
    volumes:
      - boltz-data:/root/network/data/boltz

volumes:
  boltz-data:
    driver: local
    driver_opts:
      type: none
      device: ./data/boltz
      o: bind

reliveyy commented 3 years ago

I think it's a right decision to migrate from bind mounts to volumes. Here are the reasons from Docker official docs:

Volumes are easier to back up or migrate than bind mounts.

You can manage volumes using Docker CLI commands or the Docker API.

Volumes work on both Linux and Windows containers.

Volumes can be more safely shared among multiple containers.

Volume drivers let you store volumes on remote hosts or cloud providers, to encrypt the contents of volumes, or to add other functionality.

New volumes can have their content pre-populated by a container.

Volumes on Docker Desktop have much higher performance than bind mounts from Mac and Windows hosts.

The only problem now of using "local" driver volume is that it got two copies of data, one in /var/lib/docker and another in your custom location. That's not acceptable for blockchain data. But there is a Docker volume driver plugin called "local-persist" may fit our requirements.

reliveyy commented 3 years ago

We cannot use "local-persist" plugin right now because it requires an extra daemon running on the host. So the realistic solution for this will be fixing boltz data files permission after they created.

reliveyy commented 3 years ago

Lndbtc died quickly because of "unable to initialize neutrino backend: unable to create neutrino light client: tor host is unreachable"

lndbtc_1   | 2021-01-25 06:23:51,701 INFO exited: lnd (exit status 1; not expected)
lndbtc_1   | 2021-01-25 06:23:52,705 INFO spawned: 'lnd' with pid 2383
lndbtc_1   | [DEBUG] Enabling neutrino
lndbtc_1   | Waiting for lnd-bitcoin onion address...
lndbtc_1   | Onion address for lnd-bitcoin is iywbic3wi2woxqows7xsbym7l5dke7wm5qwixyvgt2pqwhjli2yjruad.onion
lndbtc_1   | 2021-01-25 06:23:52.803 [INF] LTND: Version: 0.11.1-beta commit=v0.11.1-beta, build=production, logging=default
lndbtc_1   | 2021-01-25 06:23:52.803 [INF] LTND: Active chain: Bitcoin (network=mainnet)
lndbtc_1   | 2021-01-25 06:23:52.804 [INF] LTND: Opening the main database, this might take a few minutes...
lndbtc_1   | 2021-01-25 06:23:52.806 [INF] LTND: Opening bbolt database, sync_freelist=false
lndbtc_1   | 2021-01-25 06:23:52.814 [INF] CHDB: Checking for schema update: latest_version=17, db_version=17
lndbtc_1   | 2021-01-25 06:23:52.817 [INF] LTND: Database now open (time_to_open=10.7891ms)!
lndbtc_1   | 2021-01-25 06:23:53.743 [ERR] LTND: unable to initialize neutrino backend: unable to create neutrino light client: tor host is unreachable
lndbtc_1   | 2021-01-25 06:23:53,744 INFO success: lnd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
lndbtc_1   | 2021-01-25 06:23:53.745 [INF] LTND: Shutdown complete
lndbtc_1   | unable to initialize neutrino backend: unable to create neutrino light client: tor host is unreachable
lndbtc_1   | 2021-01-25 06:23:53,747 CRIT uncaptured python exception, closing channel <POutputDispatcher at 140384854133872 for <Subprocess at 140384854336464 with name lnd in state RUNNING> (stderr)> (<class 'OSError'>:[Errno 29] Invalid seek [/usr/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|220] [/usr/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|270] [/usr/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|204] [/usr/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|173] [/usr/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/usr/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/usr/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/usr/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

kilrau commented 3 years ago

If tor doesnt start, its usually a permission issue

reliveyy commented 3 years ago

tor host is unreachable

It's not a Tor issue. I found one of our Neutrino peer becomes invalid and it fails lnd startup (although it shouldn't be).

reliveyy commented 3 years ago

FYI, we are still getting boltz status "btc down; ltc down" because of Golang Docker SDK error "unable to upgrade to tcp, received 200". But the boltz wapper getinfo actually works.

reliveyy commented 3 years ago

unable to upgrade to tcp, received 200

This boltz status issue has been fixed.

reliveyy commented 3 years ago

Another compatibility issue: cannot bring up proxy with an exiting master mainnet.

ExchangeUnion / xud-docker