NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Data Movement: Possible overwrite on copy out #122

Closed bdevcich closed 4 months ago

bdevcich commented 5 months ago

When performing a copy out (from either directives or copy offload API), there is potential for users to overrwrite their data if they do not use a naming scheme for their data that is unique to the compute node (e.g. flux/mpi rank, hostname).

This example illustrates what happens today.

In this example, let's use a 2 rabbit, 4 compute system where each rabbit has 2 compute nodes. The user is using all 4 nodes for their flux job and they want to move their data out from gfs2 to global lustre. The application they are running across the computes simply dumps out a file named data.out. With 4 compute nodes, that's 4 data.out files.

Since each filesystem is unique for each compute node, the data movement effectively needs to merge those 4 filesystems into one destination location.

From the rabbit's perspective, these locations are mounted at:

rabbit-node1:

/mnt/nnf/a8e02956-de88-472d-9a5b-a64e4a54d568-0/0/ -> compute-node1
/mnt/nnf/a8e02956-de88-472d-9a5b-a64e4a54d568-0/1/ -> compute-node2

rabbit-node2:

/mnt/nnf/a8e02956-de88-472d-9a5b-a64e4a54d568-0/0/ -> compute-node3
/mnt/nnf/a8e02956-de88-472d-9a5b-a64e4a54d568-0/1/ -> compute-node4

The extra 0/ and 1/ directories are the index mount directories that map to each compute node.

So for this workflow, the $DW_JOB_MY_GFS2 variable would point at /mnt/nnf/a8e02956-de88-472d-9a5b-a64e4a54d568-0 without the trailing index mount directories. Since this value is used across all the compute nodes, it must remain the same.


If the user's copy_out directive looked like this:

$DW copy_out source=$DW_JOB_MY_GFS2 destination=/lus/global/user/my-job/

Then the resulting destination would look like this:

/lus/global/user/my-job/0/data.out
/lus/global/user/my-job/1/data.out

2 of the 4 data.out files are missing.

Since we are instructing dcp to copy everything out of `$DW_JOB_MY_GFS2 the index mount directories come with it. And since the indexs are the same accross the rabbits, 2 of those files get overwritten.


Our initial thoughts are to create some sort of unique identifer that combines the rabbit/compute relationship. That could then be appended to the destination to ensure that no data can be clobbered.

This simplest would be to use the rabbit hostname and the index directory, we could append this directory to the destination and the following would result when using the same destination=/lus/global/user/my-job/ from above:

/lus/global/user/my-job/rabbit-node-1-0/data.out
/lus/global/user/my-job/rabbit-node-1-1/data.out
/lus/global/user/my-job/rabbit-node-2-0/data.out
/lus/global/user/my-job/rabbit-node-2-1/data.out

I realize that the user most likely does not care about the rabbit hostname and index directory, but it would prevent users from overwriting their data if that data is not aptly organized.

bdevcich commented 5 months ago

This can be replicated by (assuming a 2 rabbit node system with 2 computes on each):

flux run -l -N4 --setattr=dw="#DW jobdw type=gfs2 capacity=10GiB name=my-gfs2 \
    #DW copy_out source=\$DW_JOB_my_gfs2 destination=/lus/global/user/my-job/ profile=no-xattr" \
    bash -c 'fallocate -l1G $DW_JOB_my_gfs2/data.out'

Which results in:

$ cd /lus/global/user/my-job
$ tree
.
|-- 0
|   `-- data.out
`-- 1
    `-- data.out

2 directories, 2 files
bdevcich commented 4 months ago

This issue is fixed via https://github.com/NearNodeFlash/nnf-sos/pull/257. Each index mount directory now contains the rabbit name, so you get a unique index mount directory for every compute node:

$ cd /lus/global/user/my-job
$ tree
.
├── rabbit-node-1-0
│   └── data.out
├── rabbit-node-1-1
│   └── data.out
├── rabbit-node-2-0
│   └── data.out
└── rabbit-node-2-1
    └── data.out

These directories don't mean much to the user, but this protects any from any potential data loss.

Note: One important thing here: the my-job/ directory must exist in the /lus/global/user/ directory. This is discussed in another issue: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/130