flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Flux 0.57.0, `flux filemap get` fails if file was mapped with multiple chunks #5655

Closed ardangelo closed 9 months ago

ardangelo commented 9 months ago

Running Flux 0.57.0, I am attempting to map a file using flux filemap,

File information

$ du $PWD/cti/install/libexec/cti_be_daemon2.18.3
2456    /g/g11/dangelo3/cti/install/libexec/cti_be_daemon2.18.3
$ sha1sum $PWD/cti/install/libexec/cti_be_daemon2.18.3
8868b9a0f785cef702825033dcc0ee138a2d3b70  /g/g11/dangelo3/cti/install/libexec/cti_be_daemon2.18.3

Map with default 1M chunk size fails get

$ flux filemap map -C $PWD/cti/install/libexec cti_be_daemon2.18.3
$ flux filemap get cti_be_daemon2.18.3
flux-filemap: cti_be_daemon2.18.3: error loading offset=0 size=1048576 from sha1-459929af132509bbcc513878cb6360ed04c2ad6c: mapped file content has changed

Map with unlimited chunk size enables get

$ flux filemap map --chunksize=0 -C $PWD/cti/install/libexec cti_be_daemon2.18.3
$ flux filemap get cti_be_daemon2.18.3
$ # No error
grondo commented 9 months ago

I'm not sure, but could this be because you are calling flux filemap get on the same node on which you mapped the file? When you call get, if there are multiple chunks then the first chunk might be written thus corrupting the map. Perhaps with a single chunk the file does not actually change since data is written in one go.

Try the flux filemap get to a different path with -C and see if that resolves the issue.

grondo commented 9 months ago

Ah, @ardangelo, I apologize. I didn't notice you were already using -C in flux filemap map.

I think your usage should work, and FWIW, I could not reproduce the error with a simple test case:

$ flux version
commands:           0.57.0
libflux-core:       0.57.0
libflux-security:   0.10.0
broker:         0.57.0
FLUX_URI:       local:///var/tmp/grondo/flux-TUaqej/local-0
build-options:      +ascii-only+systemd+hwloc==2.8.0+zmq==4.3.4
$ ls -lh map
total 0
-rw-r--r-- 1 grondo grondo 10M Jan  4 16:41 10M
$ flux filemap map -C $PWD/map 10M
$ flux filemap list
10M
$ flux filemap get 10M
$ ls -lh 10M
-rw-r--r-- 1 grondo grondo 10M Jan  4 16:51 10M
$ flux filemap list --raw | jq
{
  "path": "10M",
  "encoding": "blobvec",
  "data": [
    [
      0,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      1048576,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      2097152,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      3145728,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      4194304,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      5242880,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      6291456,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      7340032,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      8388608,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ],
    [
      9437184,
      1048576,
      "sha1-3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3"
    ]
  ],
  "size": 10485760,
  "mtime": 1704415313,
  "ctime": 1704415382,
  "mode": 33188
}
ardangelo commented 9 months ago

Thanks for checking. It seems to be an intermittent problem on the system I'm testing on. I'm able to work around it by only pulling the file on the ranks that aren't running the filemap.

grondo commented 9 months ago

Thanks for reporting back. The error:

flux-filemap: cti_be_daemon2.18.3: error loading offset=0 size=1048576 from sha1-459929af132509bbcc513878cb6360ed04c2ad6c: mapped file content has changed

Does seem to indicate that somehow the mapped file content is changing somehow, but in your case I can't imagine how and I'm not sure how the broker detects the changed content. Perhaps @garlick will have some thoughts here.

garlick commented 9 months ago

I had to refresh my memory, but this code comment describes how the detection occurs:

Safety issue: The content addressable storage model relies on the fact that once hashed, data does not change. However, this cannot be assured when the data is mmapped from a file that may not be protected from updates. To avoid propagating bad data in the cache, content_mmap_validate() is called each time an mmapped cache entry is accessed. This function recomputes the hash to make sure the content has not changed. If the data has changed, the content-cache returns an error to the requestor. In addition, mmapped pages could become invalid if the size of a mapped file is reduced. Accessing invalid pages could cause the broker to crash with SIGBUS. To mitigate this, content_mmap_validate() also calls stat(2) on the file to make sure the memory region is still valid. This is not foolproof because it is inherently a time-of-check-time-of-use problem. In fact we rate limit the calls to stat(2) to avoid a "stat storm" when a file with many blobrefs is accessed, which increases the window where it could have changed. But it's likely better than not checking at all.

garlick commented 9 months ago

It sure sounds like the source file is getting clobbered by one or more of the flux filemap get commands. If that's not a possibility then we should dig deeper to figure out what is going on.

This thing is a bit of a foot gun though. Hmm, currently we're just using libarchive's defaults for file extraction. Maybe we could tell it not to ovewrite existing files, e.g. set

ARCHIVE_EXTRACT_NO_OVERWRITE Existing files on disk will not be overwritten. By de‐ fault, existing regular files are truncated and overwrit‐ ten; existing directories will have their permissions up‐ dated; other pre-existing objects are unlinked and recre‐ ated from scratch.

https://man.freebsd.org/cgi/man.cgi?query=archive_write_disk

garlick commented 9 months ago

In 0.59, the source/destination overlap will be less easy to fall into at least. Let's close this for now on the assumption that something must've changed in the source file during broadcast. We can reopen if we get a verified case that contradicts that assumption.