citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.58k stars 670 forks source link

Incompatibilty of citus with Scientific Linux #3589

Open JoeCarlson opened 4 years ago

JoeCarlson commented 4 years ago

Scientific Linux is a distro out of FermiLab that is used in several national labs. It is based on RHEL. I have been unsuccessful in running citus: the installation script was successful but executing SQL commands that communicate between the coordinator and workers results in messages:

# select run_command_on_workers($$select 1$$);
ERROR:  epoll_ctl() failed: No such file or directory
CONTEXT:  PL/pgSQL function run_command_on_workers(text,boolean) line 13 at RETURN QUERY

Looking at the output of strace confirms there is an error in a call to epoll_ctl. It certainly looks like a networking issue but it's not a port blockage in our network and sestatus tells me SELinux is disabled.

colton-citus commented 4 years ago

I've been able to reproduce this on a local VM running Scientific Linux 6

JoeCarlson commented 4 years ago

It seems from the log files that the Citus maintenance daemon is failing and restarting continuously. I see messages every few seconds of it starting up then exiting with code 1. From what I can see, this indicates either the daemon thinks postmaster is dead or a configuration change.

JoeCarlson commented 4 years ago

Not sure if this is significant, but I see a message <date> DEBUG: mmap(174063616) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory Is there a requirement for HugePage support? Is this the difference in the distros that's killing me? what should /proc/meminfo have for HugePages and/or Shmem?

onderkalaci commented 4 years ago

This definitely requires some more debugging. We can reproduce this in another environment as well. Below is an internal discussion about this

The RHEL 6 server is still running. I added your public keys to its SSH authorized_keys.

You can get in like this:

ssh -i ~/.ssh/foo user@ip

What I’ve done so far on the server is the Single-Machine Cluster install steps (using the 9.3 package though).

To see the problem in action, run this:

sudo su - postgres
export PATH=$PATH:/usr/lib/postgresql/12/bin
psql -p 9700 -c "SELECT * from master_add_node('localhost', 9701);”

It looks like we have the same root cause with this but a different code-path: https://github.com/citusdata/citus/pull/3812

When I debug the issue, at this line, which is right after MultiConnectionStatePoll () -> PQconnectPoll(), the socket of the connection becomes -1 (from a valid socket).

We should re-build the wait event sets if the socket changes. However, I'm unclear why the socket is becoming -1 and would it become positive in the next iteration. But, it's probably much better to give a regular connection error/root cause anyway.

(And, we should be careful about adding -1 socket to waiteventset in WaitEventSetFromMultiConnectionStates)

SaitTalhaNisanci commented 4 years ago

@onderkalaci Did you get this problem only when you installed it from the packages? I tried installing 9.3.5 from source on redhat6 and didn't get any problem. However if I install it from packages, I also get the problem. One difference between package and source is the security compilation flags.

onderkalaci commented 4 years ago

Probably related to https://github.com/citusdata/citus/issues/4105