GlobalArrays / ga

Partitioned Global Address Space (PGAS) library for distributed arrays
http://hpc.pnl.gov/globalarrays/
Other
97 stars 38 forks source link

Fix gethostid returning the same val on different hosts #258

Closed fancer closed 3 weeks ago

fancer commented 2 years ago

It's possible to have the HPC host IDs improperly initialized. In that case the gethostid() method may return the same values on different hosts causing the GA runtime error like:

_malloc_semaphore: sem_open: No such file or directory [0] Received an Error in Communication: (-2) _malloc_semaphore: sem_open

One of our systems has that misbehaviour caused by the /etc/hostid file initialized with 8-chars string differing only in the tail four chars while the gethostid() reads the first four bytes only expecting them to indicate an unique host ID. As a result the Global Arrays MPI PT/PR/MT setups failed to run on greater than one cluster node.

We suggest to fix it by locally reading the hostid content and by taking the whole string (up to 32 bytes) into account when calculating the host IDs detected on the nodes involved in the GA communications. If there is no /etc/hostid file found the /etc/machine-id file will be used (it's initialized by systemd on the hosts bootup procedure). If none of these files is detected then the gethostid() method will be utilized, which aside with checking the /etc/hostid file availability will try to calculate the IP-address-based host ID.

Note this change has been tested on the system with 8-chars long /etc/hostid content and GA built with --with-mpi-pt and --with-mpi-pr parameters.

edoapra commented 2 years ago

@fancer The pull request should be made from your develop branch against develop

fancer commented 2 years ago

@edoapra alas I can't change the source branch of the PR. So what do you suggest then? Create a new PR (then this one will need to be closed) or rebase my master branch on top of the original develop branch?

fancer commented 2 years ago

@fancer The pull request should be made from your develop branch against develop

@edoapra I force-pushed my master branch with the changes in your develop branch and rebased my patch on top of it. After that I changed the base branch of this PR to the develop one. Now the PR looks the way you requested except the source branch is still master since I can't change it. When you get to merge this PR I'll force-get my master branch back to the normal state.

jeffhammond commented 3 weeks ago

@ajaypanyala does this fix the Polaris issue you mentioned?

ajaypanyala commented 3 weeks ago

I have not tried this PR on Polaris. I just realized I do not have a Polaris allocation anymore to test this.

ajaypanyala commented 3 weeks ago

This will be fixed in a future PR (probably using a different solution).