NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

Data Movement: different behavior when destination directories do not exist #130

Closed bdevcich closed 2 months ago

bdevcich commented 4 months ago

When doing something like the following where the my-job/ directory does not exist at /lus/global/user/ (the destination for the copy_out directive):

flux run -l -N4 --setattr=dw="#DW jobdw type=gfs2 capacity=10GiB name=my-gfs2 \
    #DW copy_out source=\$DW_JOB_my_gfs2 destination=/lus/global/user/my-job/ profile=no-xattr" \
    bash -c 'fallocate -l1G $DW_JOB_my_gfs2/data.out'

The resulting copy out operation can look like this:

$ cd /lus/global/user/my-job/
$ tree
.
├── data.out
├── rabbit-node-1-1
│   └── data.out
├── rabbit-node-2-0
│   └── data.out
└── rabbit-node-2-1
    └── data.out

We get 4 total files, but one of them is at the root level of my-job/ and the index mount directory (i.e. rabbit-node-1-0) did not get copied over. In this case, no harm, no foul (but confusing) since all the data.out files are present. But this can also result in something like this:

$ cd /lus/global/user/my-job/
$ tree
.
├── data.out
├── rabbit-node-2-0
│   └── data.out
└── rabbit-node-2-1
    └── data.out

Not good.

To break this down, the job/workflow will end up creating 4 NnfDatamovements in the DataOut state since it is ran on 4 compute nodes. That is, 4 different data movement operations to move each compute nodes' data from the rabbit to the global lustre filesystem. Those 4 data movement will run (almost) in parallel. They would look something like this for 2 computes per rabbit:

dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-1-0 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-1-1 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-2-0 /lus/global/user/my-job/
dcp /mnt/nnf/140d8ac6-4012-4e06-a08e-7ec6bbb65f7d-0/rabbit-node-2-1 /lus/global/user/my-job/

One (or more) of these operations will win. That first time through, the my-job directory does not exist. So that first dcp operation is essentially a directory copy:

cp -r src/ my-job/

Since my-job doesn't exist it, the contents of the source are copied directly to the my-job directory. This is the reason for the lone data.out at the root level above.

Afterwards, each subsequent dcp operation will perform the same request, but that directory now exists. Which results in the index mount directory being copied over.

bdevcich commented 4 months ago

I think this becomes a problem of understanding dcp behavior and how we handle it. You can replicate this same scenario on a local filesystem.

$ tree
.
|-- dest
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

4 directories, 2 files

$ dcp src/node-0 dest/my-job
...

$ tree
.
|-- dest
|   `-- my-job
|       `-- data.out
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

5 directories, 3 files

$ dcp src/node-1 dest/my-job
...

$ tree
.
|-- dest
|   `-- my-job
|       |-- data.out
|       `-- node-1
|           `-- data.out
`-- src
    |-- node-0
    |   `-- data.out
    `-- node-1
        `-- data.out

6 directories, 4 files
bdevcich commented 4 months ago

This begs the question: should the destination directories be required to preexist? If you were to go one level deeper, dcp would fail:

dcp src/node-0 dest/my-job/rank0
[2024-02-16T20:34:50] [0] [/deps/mpifileutils/src/common/mfu_param_path.c:582] ERROR: Destination parent directory is not writable `/home/mpiuser/dest/my-job' (errno=2 No such file or directory)
[2024-02-16T20:34:50] [0] [/deps/mpifileutils/src/dcp/dcp.c:479] ERROR: Invalid src/dest paths provided. Exiting run: MFU_ERR(-1001)

If this was a really long running job with a copy_out, it would be unfortunate for the user to get all the way through the job and then find out in DataOut that dcp has a No such file or directory error.

bdevcich commented 4 months ago

Might be related: https://github.com/hpc/mpifileutils/issues/416

bdevcich commented 4 months ago

As discussed in the Flux meeting today, we need to investigate doing two things:

  1. Validate the permissions on the destination path before PreRun (so that the mkdir is successful)
  2. Do a mkdir right before the copy (or bake it into dcp as an option)

Creating the directory to ensure the destination exists is required behavior to work around how dcp works: the directory needs to exist before the data movement operation is attempted.

We are in agreement that we need a way to head off user mistakes early on otherwise the data movement could fail during the CopyOut. That mkdir could fail if the user does not have permission to create the directory in the location given.

Things could also change between 1 and 2 where the user application changes directory structure (for example) - not exactly sure what we can do there. We can't guard against everything the user application is capable of.

Additional idea: lost+found If data movement fails due to destination issues, move the data to some lost+found directory on the global lustre filesystem. This destination would be defined by an administrator. The workflow name, user, etc could all be used to seperate the data in lost+found


After further discussion on our end, we've decided to explore implementing lost+found alongside step No. 2. In the event that step 2 or the data movement encounters any issues, lost+found will be activated to salvage the data. Given the potential for various events to occur between steps 1 and 2, we believe that performing validation upfront is unnecessary. With the implementation of lost+found, users will have the option to retrieve their data if they encounter any issues.

bdevcich commented 2 months ago

mkdir function has been added along with ensuring index mount directories are created on the destination when copying out from gfs2/xfs filesystems.

https://github.com/NearNodeFlash/nnf-dm/pull/167

bdevcich commented 2 months ago

Closing this and opened #151 for the lost+found part.