polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable

qwe123520 commented 7 months ago

Describe the problem

docker单节点启动polardb-pg修改配置文件起不来报错polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable 配置文件如下：

postgresql.txt

...

polardb-bot[bot] commented 7 months ago

Hi @qwe123520 ~ Thanks for opening this issue! 🎉

Please make sure you have provided enough information for subsequent discussion.

We will get back to you as soon as possible. ❤️

qwe123520 commented 7 months ago

错误日志如下： 2024-04-26 15:31:50.066 CST [14] [14] LOG: forked new process, pid is 16, true pid is 16 2024-04-26 15:31:50.066 CST [14] [14] LOG: forked new process, pid is 17, true pid is 17 2024-04-26 15:31:50.078 CST [14] [14] LOG: polardb try start vfs process 2024-04-26 15:31:50.078 CST [14] [14] LOG: pfs in localfs mode 2024-04-26 15:31:50.081 CST [14] [14] FATAL: polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable. 2024-04-26 15:31:50.081 CST [14] [14] BACKTRACE:
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(elog_finish+0x1fd) [0x555e31bde55d] /home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(+0x7db1ae) [0x555e31a4d1ae] /home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(PostmasterMain+0xf53) [0x555e319dbf63] /home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(main+0x830) [0x555e316bacf0] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ace30cd90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ace30ce40] /home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(_start+0x25) [0x555e316ca6d5] 2024-04-26 15:31:50.202 CST [14] [14] LOG: database system is shut down

mrdrivingduck commented 7 months ago

@qwe123520 What is your docker startup command?

qwe123520 commented 7 months ago

使用的这个镜像”polardb/polardb_pg_local_instance“，没有配置额外的启动命令。

mrdrivingduck commented 7 months ago

@qwe123520 跟镜像没有关系，跟从镜像上启动容器的方式有关系。所以我在询问启动容器的命令是什么？用下面的命令启动容器呢？

docker pull polardb/polardb_pg_local_instance
docker run -it --rm polardb/polardb_pg_local_instance psql

qwe123520 commented 7 months ago

docker run -d --name polardb -v /data/polardb/:/var/polardb/ polardb/polardb_pg_local_instance使用的这个命令启动的。

qwe123520 commented 7 months ago

docker run -it --rm polardb/polardb_pg_local_instance psql我只要-v使用本机目录就不行

mrdrivingduck commented 7 months ago

docker run -d --name polardb -v /data/polardb/:/var/polardb/ polardb/polardb_pg_local_instance使用的这个命令启动的。

本机目录上 /data/polardb/ 这个目录存在且非空吗？

qwe123520 commented 7 months ago

是的，它存在并且非空

mrdrivingduck commented 7 months ago

是的，它存在并且非空

需要用一个存在且空白的目录来启动容器，这样容器启动脚本发现目录为空就会在这个目录中 initdb 创建数据目录；如果启动脚本发现目录不为空，就会按启动脚本中指定好的数据目录拉起数据库，如果目录中已有内容是一些别的文件就有问题。

qwe123520 commented 7 months ago

这个目录是之前启动的时候创建出来的，然后修改了postgres.conf然后就起不来了

Mr-TTWang commented 6 months ago

@mrdrivingduck 快来回答问题啦

Mr-TTWang commented 6 months ago

就是修改里面postgres.conf之后才会出现这样的问题就不知道和shared_datadir 有啥关系快出来解决问题啦~~~~~~~~~~~~~~~~

快快快

Mr-TTWang commented 6 months ago

还有就是恢复之前的conf内容都不行就改不得

mrdrivingduck commented 6 months ago

@qwe123520 @SamirWell

具体修改了什么内容？可否提供下 diff？
根据启动命令，/data/polardb/ 下应该会有 primary_dir/ 之类的几个目录。可以看下每个目录中的 current_logfiles 找到错误日志名称，看看最后的错误日志内容是什么

Mr-TTWang commented 6 months ago

2024-05-22 17:56:18.943 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history.tmp 2024-05-22 17:56:18.944 CST [20] [20] LOG: vfs_rename from file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history.tmp to file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history 2024-05-22 17:56:18.944 CST [20] [20] LOG: The flashback log will switch from 0/877E0 to 0/10000000 2024-05-22 17:56:18.944 CST [20] [20] LOG: The flashback log shared buffer is ready now, the current point(position) is 0/10000000(0/FF3FFF0), previous point(position) is 0/0(0/0), initalized upto point is 0/10000000 2024-05-22 17:56:18.945 CST [20] [20] LOG: enable persisted slot, read slot from polarstore. 2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs open dir pg_replslot, num open dir 1 2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs open dir file-dio:///var/polardb/shared_datadir/pg_replslot, num open dir 1 2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/pg_replslot/replica1/state.tmp 2024-05-22 17:56:18.946 CST [20] [20] LOG: restore slot replica1 with version 10002, replay_lsn is 0/1BA24B8, restart_lsn is 0/1752788 2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/pg_replslot/replica2/state.tmp 2024-05-22 17:56:18.946 CST [20] [20] LOG: restore slot replica2 with version 10002, replay_lsn is 0/1BA24B8, restart_lsn is 0/1752788 2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs open dir pg_replslot, num open dir 1 2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs open dir file-dio:///var/polardb/shared_datadir/pg_twophase, num open dir 1 2024-05-22 17:56:18.946 CST [20] [20] LOG: database system was not properly shut down; automatic recovery in progress 2024-05-22 17:56:18.946 CST [20] [20] LOG: state is 4 2024-05-22 17:56:18.965 CST [19] [19] LOG: polar_flog_index log index is insert from 28 2024-05-22 17:56:19.023 CST [19] [19] WARNING: The flashback log record at 0/895F0 will be ignore. and switch to 0/10000028 2024-05-22 17:56:19.023 CST [19] [19] LOG: Recover the flashback logindex to 0/10000000 2024-05-22 17:56:19.362 CST [21] [21] PANIC: polardb shared storage is unavailable. 2024-05-22 17:56:19.362 CST [21] [21] BACKTRACE: postgres(5432): polar worker process (+0x3fdc5e) [0x560ccc2d4c5e] /home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/polar_worker.so(polar_worker_handler_main+0xd6) [0x7fdf24745ff6] postgres(5432): polar worker process (StartBackgroundWorker+0x2d7) [0x560ccc629517] postgres(5432): polar worker process (+0x76441c) [0x560ccc63b41c] postgres(5432): polar worker process (+0x765dbe) [0x560ccc63cdbe] postgres(5432): polar worker process (PostmasterMain+0xd4c) [0x560ccc640d5c] postgres(5432): polar worker process (main+0x830) [0x560ccc31fcf0] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fdf231fed90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fdf231fee40] postgres(5432): polar worker process (_start+0x25) [0x560ccc32f6d5]

Mr-TTWang commented 6 months ago

修改了所有目录内的conf里面 max_connections = 2000

Mr-TTWang commented 6 months ago

其实我刚才的做法是先 docker 初始化了数据库没有启动修改了所有里面的配置最大连接数为2000

然后启动docker 是ok的，

我再次重启一下容器就不行了额，应该是另有原因，看着像是重新挂载方面的问题


inline int
polar_mount(void)
{
    int ret = 0;
    if (polar_vfs[polar_vfs_switch].vfs_mount)
        ret = polar_vfs[polar_vfs_switch].vfs_mount();
    if (polar_enable_io_fencing && ret == 0)
    {
        /* POLAR: FATAL when shared storage is unavailable, or force to write RWID. */
        if (polar_shared_storage_is_available())
        {
            polar_hold_shared_storage(false);
            POLAR_IO_FENCING_SET_STATE(polar_io_fencing_get_instance(), POLAR_IO_FENCING_WAIT);
        }
        else
            elog(FATAL, "polardb shared storage %s is unavailable.", polar_datadir);
    }
    return ret;
}

inline int
polar_remount(void)
{
    int ret = 0;
    if (polar_vfs[polar_vfs_switch].vfs_remount)
        ret = polar_vfs[polar_vfs_switch].vfs_remount();
    if (polar_enable_io_fencing && ret == 0)
    {
        /* POLAR: FATAL when shared storage is unavailable, or force to write RWID. */
        if (polar_shared_storage_is_available())
        {
            polar_hold_shared_storage(true);
            POLAR_IO_FENCING_SET_STATE(polar_io_fencing_get_instance(), POLAR_IO_FENCING_WAIT);
        }
        else
            elog(FATAL, "polardb shared storage %s is unavailable.", polar_datadir);
    }
    return ret;
}

Mr-TTWang commented 6 months ago

@mrdrivingduck 要不你测试下场景

mrdrivingduck commented 6 months ago

我测试了如下场景，没有发现问题：

$ mkdir polardb_pg
$ docker run -it --rm \
    --env POLARDB_PORT=5432 \
    --env POLARDB_USER=u1 \
    --env POLARDB_PASSWORD=your_password \
    -v ./polardb_pg:/var/polardb \
    polardb/polardb_pg_local_instance \
    echo 'done'

## edit max_connections in three postgresql.conf files

$ docker run -d \
    -p 54320-54322:5432-5434 \
    -v ./polardb_pg:/var/polardb \ 
    polardb/polardb_pg_local_instance

36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61

$ docker exec -it 36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61 bash
$ ps -ef
$ exit

$ docker stop 36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61            
36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61

$ docker run -d \                                                                      
    -p 54320-54322:5432-5434 \
    -v ./polardb_pg:/var/polardb \
    polardb/polardb_pg_local_instance

cdbffcd6b3e6e2f55ac98ee61bfd48ac185db624f5142f3dfc7a0f920ac7a154

$ docker exec -it cdbffcd6b3e6e2f55ac98ee61bfd48ac185db624f5142f3dfc7a0f920ac7a154 bash
$ ps -ef

Mr-TTWang commented 6 months ago

可能是我在k3s上面部署的原因吗？

mrdrivingduck commented 6 months ago

可能是我在k3s上面部署的原因吗？

需要看下在容器内能否正确访问 /var/polardb/shared_datadir，以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

Mr-TTWang commented 6 months ago

可能是我在k3s上面部署的原因吗？

需要看下在容器内能否正确访问 /var/polardb/shared_datadir，以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

如果是k3s或者k8s这种滚动升级，存在同时挂载的时间窗，就会挂掉是不~

刚才又重新测试下这种延迟重启的场景还是挂的 o(╥﹏╥)o

mrdrivingduck commented 6 months ago

可能是我在k3s上面部署的原因吗？

需要看下在容器内能否正确访问 /var/polardb/shared_datadir，以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

如果是k3s或者k8s这种滚动升级，存在同时挂载的时间窗，就会挂掉是不~

polardb_pg_local_instance 这个镜像是一个在单机运行共享存储集群的 demo，里面有个简单的 entrypoint 脚本来做管理，目的是方便快速拉起并体验。如果有外部的集群管理和存储管理，那么会和这里面运行的 entrypoint 脚本冲突。建议直接使用纯二进制镜像 polardb/polardb_pg_binary 来适配集群管理工具，这里面是没有管理脚本的。

Mr-TTWang commented 6 months ago

最后测试重启前执行

rm -f $shared_datadir/DEATH

就好了，这样就适合在k8s/k3s上单节点部署使用了吧

mrdrivingduck commented 6 months ago

最后测试重启前执行
rm -f $shared_datadir/DEATH
就好了，这样就适合在k8s/k3s上单节点部署使用了吧

产生这个文件说明至少有两个数据库实例在同一份数据目录上启动了。这样是有问题的。

ApsaraDB / PolarDB-for-PostgreSQL

polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable #503