Open JoeCarlson opened 4 years ago
I've been able to reproduce this on a local VM running Scientific Linux 6
It seems from the log files that the Citus maintenance daemon is failing and restarting continuously. I see messages every few seconds of it starting up then exiting with code 1. From what I can see, this indicates either the daemon thinks postmaster is dead or a configuration change.
Not sure if this is significant, but I see a message
<date> DEBUG: mmap(174063616) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
Is there a requirement for HugePage support? Is this the difference in the distros that's killing me? what should /proc/meminfo
have for HugePages and/or Shmem?
This definitely requires some more debugging. We can reproduce this in another environment as well. Below is an internal discussion about this
The RHEL 6 server is still running. I added your public keys to its SSH authorized_keys.
You can get in like this:
ssh -i ~/.ssh/foo user@ip
What I’ve done so far on the server is the Single-Machine Cluster install steps (using the 9.3 package though).
To see the problem in action, run this:
sudo su - postgres
export PATH=$PATH:/usr/lib/postgresql/12/bin
psql -p 9700 -c "SELECT * from master_add_node('localhost', 9701);”
It looks like we have the same root cause with this but a different code-path: https://github.com/citusdata/citus/pull/3812
When I debug the issue, at this line, which is right after MultiConnectionStatePoll ()
-> PQconnectPoll()
, the socket of the connection becomes -1
(from a valid socket).
We should re-build the wait event sets if the socket changes. However, I'm unclear why the socket is becoming -1 and would it become positive in the next iteration. But, it's probably much better to give a regular connection error/root cause anyway.
(And, we should be careful about adding -1
socket to waiteventset in WaitEventSetFromMultiConnectionStates
)
@onderkalaci Did you get this problem only when you installed it from the packages? I tried installing 9.3.5 from source on redhat6 and didn't get any problem. However if I install it from packages, I also get the problem. One difference between package and source is the security compilation flags.
Probably related to https://github.com/citusdata/citus/issues/4105
Scientific Linux is a distro out of FermiLab that is used in several national labs. It is based on RHEL. I have been unsuccessful in running citus: the installation script was successful but executing SQL commands that communicate between the coordinator and workers results in messages:
Looking at the output of strace confirms there is an error in a call to epoll_ctl. It certainly looks like a networking issue but it's not a port blockage in our network and sestatus tells me SELinux is disabled.