ECP-VeloC / shuffile

Move (shuffle) files on local-storage associated with MPI ranks when re-running MPI jobs that cause different mapping of ranks to nodes
MIT License
0 stars 3 forks source link

test failure #18

Open Alessandro-Barbieri opened 3 years ago

Alessandro-Barbieri commented 3 years ago

I've packaged shuffile for the gentoo overlay guru but our CI failed at testing it https://bugs.gentoo.org/784647

Build log: https://784647.bugs.gentoo.org/attachment.cgi?id=701388 Test log: https://784647.bugs.gentoo.org/attachment.cgi?id=701391

SHUFFILE 0.0.4: rank 2 on localhost: Opening file: open(/dev/shm/testfile_2.out) errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile_io.c:112
SHUFFILE 0.0.4 ERROR: rank 0 on localhost: chmod(/dev/shm/testfile_2.out) failed: errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:553
After second migrate, READ IN 16 bytes
After second migrate, READ IN data from rank 0
data = data from rank 0, rank=2
SHUFFILE 0.0.4 ERROR: rank 0 on localhost: chown(/dev/shm/testfile_2.out, 250, 250) failed: errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:570
SHUFFILE 0.0.4 ERROR: rank 0 on localhost: Error stat'ing file /dev/shm/testfile_2.out: errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile_io.c:545
SHUFFILE 0.0.4 ERROR: rank 0 on localhost: stat(/dev/shm/testfile_2.out) failed: errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:594
SHUFFILE 0.0.4 ERROR: rank 0 on localhost: Failed to change timestamps on `/dev/shm/testfile_2.out' utimensat() errno=2 No such file or directory @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:628
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_create comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:383
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_migrate comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:686
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_remove name parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:434
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_create comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:383
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_migrate comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:686
SHUFFILE 0.0.4 ERROR: rank 2 on localhost: shuffile_remove name parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:434
Error in line 135, file /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/test/test1.c, function main.
Error opening read file /dev/shm/testfile_2.out: 2 No such file or directory
After second migrate, READ IN 16 bytes
After second migrate, READ IN data from rank 1
data = data from rank 1, rank=1
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_create comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:383
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_migrate comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:686
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_remove name parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:434
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_create comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:383
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_migrate comm or comm_storage parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:686
SHUFFILE 0.0.4 ERROR: rank 1 on localhost: shuffile_remove name parameter is MPI_COMM_NULL @ /var/tmp/portage/sys-cluster/shuffile-0.0.4/work/shuffile-0.0.4/src/shuffile.c:434
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
gonsie commented 2 years ago

Does gentoo have a /dev/shm directory? Our linux testing assumes this directory exists.

Alessandro-Barbieri commented 2 years ago

Yes, it exist