ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
50 stars 17 forks source link

Profiling tools attempt to parse incomplete profile #334

Closed omor1 closed 2 years ago

omor1 commented 2 years ago

Is your feature request related to a problem? Please describe. Profiling will finalize the profile file data only when the PaRSEC finalization function is called. If the program crashes or exits beforehand, the profile will be incomplete. The profiling tools do not recognize this and try to parse the file regardless, leading to errors and bugs, as well as potentially bad analysis.

Describe the solution you'd like Profiling tools should identify when a profile file is incomplete and report an error. A flag (perhaps -f/--force) could be added to allow the tool to proceed regardless.

Describe alternatives you've considered Keep as-is, rely on users to ensure profile files are complete.

Additional context I didn't realize a run of HiCMA+PaRSEC segfaulted, resulting in an incomplete profile. This caused issues later when I tried to analyze it.

therault commented 2 years ago

When we open a binary profile file, we check that the header has a marker at a specific position (https://github.com/ICLDisco/parsec/blob/a2a9e1a75113db03ab483b3d787e1f426a2ac7ca/tools/profiling/dbpreader.c#L846).

That marker is written in the file when the header is dumped, so at the end of the execution (https://github.com/ICLDisco/parsec/blob/a2a9e1a75113db03ab483b3d787e1f426a2ac7ca/parsec/profiling.c#L1459).

So, the current code should already fail in the described scenario... Will try to reproduce to understand what is going wrong.

omor1 commented 2 years ago

Hmmm you're right. So clearly I a) had a valid PARSEC_PROFILING_MAGICK and b) had some other, invalid data. In fact, let me post the hexdump of the headers (first 224 bytes):

dpotrf_tlr.10217680-0.prof:
00000000  00 00 00 00 00 00 00 00  23 50 41 52 53 45 43 20  |........#PARSEC |
00000010  42 49 4e 41 52 59 20 50  52 4f 46 49 4c 45 20 0c  |BINARY PROFILE .|
00000020  00 70 75 49 55 15 00 00  ef cd ab 89 67 45 23 01  |.puIU.......gE#.|
00000030  00 10 00 00 2f 68 6f 6d  65 2f 6f 6d 6f 72 31 2f  |..../home/omor1/|
00000040  73 70 61 63 6b 2f 6f 70  74 2f 73 70 61 63 6b 2f  |spack/opt/spack/|
00000050  6c 69 6e 75 78 2d 63 65  6e 74 6f 73 38 2d 7a 65  |linux-centos8-ze|
00000060  6e 32 2f 61 6f 63 63 2d  33 2e 30 2e 30 2f 68 69  |n2/aocc-3.0.0/hi|
00000070  63 6d 61 2d 78 2d 64 65  76 2d 6c 6f 63 61 6c 2d  |cma-x-dev-local-|
00000080  62 71 65 77 36 6a 79 69  36 32 6c 72 70 71 6a 72  |bqew6jyi62lrpqjr|
00000090  79 63 36 77 6f 61 78 75  37 35 32 6a 76 35 75 32  |yc6woaxu752jv5u2|
000000a0  2f 6c 69 62 65 78 65 63  2f 68 69 63 6d 61 2f 74  |/libexec/hicma/t|
000000b0  65 73 74 00 97 00 00 00  00 10 00 00 00 00 00 00  |est.............|
000000c0  13 00 00 00 00 00 00 00  00 20 b1 12 00 00 00 00  |......... ......|
000000d0  00 00 00 00 80 00 00 00  00 c0 b5 12 00 00 00 00  |................|
dpotrf_tlr.10217680-1.prof:
00000000  00 00 00 00 00 00 00 00  23 50 41 52 53 45 43 20  |........#PARSEC |
00000010  42 49 4e 41 52 59 20 50  52 4f 46 49 4c 45 20 0c  |BINARY PROFILE .|
00000020  00 70 75 49 55 15 00 00  ef cd ab 89 67 45 23 01  |.puIU.......gE#.|
00000030  00 10 00 00 2f 68 6f 6d  65 2f 6f 6d 6f 72 31 2f  |..../home/omor1/|
00000040  73 70 61 63 6b 2f 6f 70  74 2f 73 70 61 63 6b 2f  |spack/opt/spack/|
00000050  6c 69 6e 75 78 2d 63 65  6e 74 6f 73 38 2d 7a 65  |linux-centos8-ze|
00000060  6e 32 2f 61 6f 63 63 2d  33 2e 30 2e 30 2f 68 69  |n2/aocc-3.0.0/hi|
00000070  63 6d 61 2d 78 2d 64 65  76 2d 6c 6f 63 61 6c 2d  |cma-x-dev-local-|
00000080  62 71 65 77 36 6a 79 69  36 32 6c 72 70 71 6a 72  |bqew6jyi62lrpqjr|
00000090  79 63 36 77 6f 61 78 75  37 35 32 6a 76 35 75 32  |yc6woaxu752jv5u2|
000000a0  2f 6c 69 62 65 78 65 63  2f 68 69 63 6d 61 2f 74  |/libexec/hicma/t|
000000b0  65 73 74 00 97 00 00 00  00 10 00 00 00 00 00 00  |est.............|
000000c0  13 00 00 00 00 00 00 00  00 40 1e 14 00 00 00 00  |.........@......|
000000d0  01 00 00 00 80 00 00 00  00 50 22 14 00 00 00 00  |.........P".....|
dpotrf_tlr.10217680-2.prof:
00000000  00 00 00 00 00 00 00 00  23 50 41 52 53 45 43 20  |........#PARSEC |
00000010  42 49 4e 41 52 59 20 50  52 4f 46 49 4c 45 20 0c  |BINARY PROFILE .|
00000020  00 70 75 49 55 15 00 00  ef cd ab 89 67 45 23 01  |.puIU.......gE#.|
00000030  00 10 00 00 2f 68 6f 6d  65 2f 6f 6d 6f 72 31 2f  |..../home/omor1/|
00000040  73 70 61 63 6b 2f 6f 70  74 2f 73 70 61 63 6b 2f  |spack/opt/spack/|
00000050  6c 69 6e 75 78 2d 63 65  6e 74 6f 73 38 2d 7a 65  |linux-centos8-ze|
00000060  6e 32 2f 61 6f 63 63 2d  33 2e 30 2e 30 2f 68 69  |n2/aocc-3.0.0/hi|
00000070  63 6d 61 2d 78 2d 64 65  76 2d 6c 6f 63 61 6c 2d  |cma-x-dev-local-|
00000080  62 71 65 77 36 6a 79 69  36 32 6c 72 70 71 6a 72  |bqew6jyi62lrpqjr|
00000090  79 63 36 77 6f 61 78 75  37 35 32 6a 76 35 75 32  |yc6woaxu752jv5u2|
000000a0  2f 6c 69 62 65 78 65 63  2f 68 69 63 6d 61 2f 74  |/libexec/hicma/t|
000000b0  65 73 74 00 00 00 00 00  00 00 00 00 00 00 00 00  |est.............|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
dpotrf_tlr.10217680-3.prof:
00000000  00 00 00 00 00 00 00 00  23 50 41 52 53 45 43 20  |........#PARSEC |
00000010  42 49 4e 41 52 59 20 50  52 4f 46 49 4c 45 20 0c  |BINARY PROFILE .|
00000020  00 70 75 49 55 15 00 00  ef cd ab 89 67 45 23 01  |.puIU.......gE#.|
00000030  00 10 00 00 2f 68 6f 6d  65 2f 6f 6d 6f 72 31 2f  |..../home/omor1/|
00000040  73 70 61 63 6b 2f 6f 70  74 2f 73 70 61 63 6b 2f  |spack/opt/spack/|
00000050  6c 69 6e 75 78 2d 63 65  6e 74 6f 73 38 2d 7a 65  |linux-centos8-ze|
00000060  6e 32 2f 61 6f 63 63 2d  33 2e 30 2e 30 2f 68 69  |n2/aocc-3.0.0/hi|
00000070  63 6d 61 2d 78 2d 64 65  76 2d 6c 6f 63 61 6c 2d  |cma-x-dev-local-|
00000080  62 71 65 77 36 6a 79 69  36 32 6c 72 70 71 6a 72  |bqew6jyi62lrpqjr|
00000090  79 63 36 77 6f 61 78 75  37 35 32 6a 76 35 75 32  |yc6woaxu752jv5u2|
000000a0  2f 6c 69 62 65 78 65 63  2f 68 69 63 6d 61 2f 74  |/libexec/hicma/t|
000000b0  65 73 74 00 97 00 00 00  00 10 00 00 00 00 00 00  |est.............|
000000c0  13 00 00 00 00 00 00 00  00 00 a2 12 00 00 00 00  |................|
000000d0  03 00 00 00 80 00 00 00  00 60 a6 12 00 00 00 00  |.........`......|

So all four files have the correct magic, but for node 2 it is most certainly incomplete—note how e.g. the dictionary size, offset, etc. is 0. Note that the segfault was on node 2.

Two things stick out to me. First is that the segfault clearly occurred at at some point midway between writing data. Second is that there are several features in the profiling subsystem that could be conducive to this occurring—namely, that a helper thread can be used, meaning there can be a longer delay between when parsec_profiling_dbp_dump is called and when the write to disk occurs. Second is that, regardless of whether the mmap or write methods are used, the data doesn't get synced to disk—either with msync or fsync.

I think we begin approaching crash consistency territory with this, but an initial suggestion could be to write the magic key last and ensure a sync before/after this write, so that the profile is not valid until after all other data is written. Care should be taken with the mmap method, since I think in that case the helper thread is never actually writing data, just doing the munmap.

therault commented 2 years ago

You're right for the ordering. Good catch. I think it would be consistent enough to ensure that the command list between the helper thread and the other thread is FIFO, and I guess I'm using the lockless LIFO to avoid waiting on pushing orders in the list. A hard synchronization before pushing the last command that flushes the header would work. I'll check and create a PR.

W.r.t. syncing data to disk when using mmap, the code is indeed linux-specific... Since linux 2.6.19 (years ago...), the (linux) OS tracks the dirty pages even for file-backed segments. msync is in fact a noop, and munmap ensures eventual output of the page onto the disk (see discussion here in 2004 https://lkml.iu.edu/hypermail/linux/kernel/0404.2/0254.html). However, this is not POSIX, POSIX says msync is necessary before munmap, (costless in our case, so I will add it in the PR for the sake of exotic OSs).

That will not ensure that the data is written 'to disk': the file is not opened O_SYNC. If the OS (or the node) crashes soon after the end, it's still possible that the file would be corrupted, and the error could be silent (it would need the OS to flush the page with the header on the disk, but not yet some other page that would be lost), but I think this is not an issue we should care about. I'd rather take this small risk than waiting a potentially long time at the end of the execution for each execution.

On Mon, Mar 7, 2022 at 5:19 PM Omri Mor @.***> wrote:

Hmmm you're right. So clearly I a) had a valid PARSEC_PROFILING_MAGICK and b) had some other, invalid data. In fact, let me post the hexdump of the headers (first 224 bytes):

dpotrf_tlr.10217680-0.prof:

00000000 00 00 00 00 00 00 00 00 23 50 41 52 53 45 43 20 |........#PARSEC |

00000010 42 49 4e 41 52 59 20 50 52 4f 46 49 4c 45 20 0c |BINARY PROFILE .|

00000020 00 70 75 49 55 15 00 00 ef cd ab 89 67 45 23 01 |.puIU.......gE#.|

00000030 00 10 00 00 2f 68 6f 6d 65 2f 6f 6d 6f 72 31 2f |..../home/omor1/|

00000040 73 70 61 63 6b 2f 6f 70 74 2f 73 70 61 63 6b 2f |spack/opt/spack/|

00000050 6c 69 6e 75 78 2d 63 65 6e 74 6f 73 38 2d 7a 65 |linux-centos8-ze|

00000060 6e 32 2f 61 6f 63 63 2d 33 2e 30 2e 30 2f 68 69 |n2/aocc-3.0.0/hi|

00000070 63 6d 61 2d 78 2d 64 65 76 2d 6c 6f 63 61 6c 2d |cma-x-dev-local-|

00000080 62 71 65 77 36 6a 79 69 36 32 6c 72 70 71 6a 72 |bqew6jyi62lrpqjr|

00000090 79 63 36 77 6f 61 78 75 37 35 32 6a 76 35 75 32 |yc6woaxu752jv5u2|

000000a0 2f 6c 69 62 65 78 65 63 2f 68 69 63 6d 61 2f 74 |/libexec/hicma/t|

000000b0 65 73 74 00 97 00 00 00 00 10 00 00 00 00 00 00 |est.............|

000000c0 13 00 00 00 00 00 00 00 00 20 b1 12 00 00 00 00 |......... ......|

000000d0 00 00 00 00 80 00 00 00 00 c0 b5 12 00 00 00 00 |................|

dpotrf_tlr.10217680-1.prof:

00000000 00 00 00 00 00 00 00 00 23 50 41 52 53 45 43 20 |........#PARSEC |

00000010 42 49 4e 41 52 59 20 50 52 4f 46 49 4c 45 20 0c |BINARY PROFILE .|

00000020 00 70 75 49 55 15 00 00 ef cd ab 89 67 45 23 01 |.puIU.......gE#.|

00000030 00 10 00 00 2f 68 6f 6d 65 2f 6f 6d 6f 72 31 2f |..../home/omor1/|

00000040 73 70 61 63 6b 2f 6f 70 74 2f 73 70 61 63 6b 2f |spack/opt/spack/|

00000050 6c 69 6e 75 78 2d 63 65 6e 74 6f 73 38 2d 7a 65 |linux-centos8-ze|

00000060 6e 32 2f 61 6f 63 63 2d 33 2e 30 2e 30 2f 68 69 |n2/aocc-3.0.0/hi|

00000070 63 6d 61 2d 78 2d 64 65 76 2d 6c 6f 63 61 6c 2d |cma-x-dev-local-|

00000080 62 71 65 77 36 6a 79 69 36 32 6c 72 70 71 6a 72 |bqew6jyi62lrpqjr|

00000090 79 63 36 77 6f 61 78 75 37 35 32 6a 76 35 75 32 |yc6woaxu752jv5u2|

000000a0 2f 6c 69 62 65 78 65 63 2f 68 69 63 6d 61 2f 74 |/libexec/hicma/t|

000000b0 65 73 74 00 97 00 00 00 00 10 00 00 00 00 00 00 |est.............|

000000c0 13 00 00 00 00 00 00 00 00 40 1e 14 00 00 00 00 @.***|

000000d0 01 00 00 00 80 00 00 00 00 50 22 14 00 00 00 00 |.........P".....|

dpotrf_tlr.10217680-2.prof:

00000000 00 00 00 00 00 00 00 00 23 50 41 52 53 45 43 20 |........#PARSEC |

00000010 42 49 4e 41 52 59 20 50 52 4f 46 49 4c 45 20 0c |BINARY PROFILE .|

00000020 00 70 75 49 55 15 00 00 ef cd ab 89 67 45 23 01 |.puIU.......gE#.|

00000030 00 10 00 00 2f 68 6f 6d 65 2f 6f 6d 6f 72 31 2f |..../home/omor1/|

00000040 73 70 61 63 6b 2f 6f 70 74 2f 73 70 61 63 6b 2f |spack/opt/spack/|

00000050 6c 69 6e 75 78 2d 63 65 6e 74 6f 73 38 2d 7a 65 |linux-centos8-ze|

00000060 6e 32 2f 61 6f 63 63 2d 33 2e 30 2e 30 2f 68 69 |n2/aocc-3.0.0/hi|

00000070 63 6d 61 2d 78 2d 64 65 76 2d 6c 6f 63 61 6c 2d |cma-x-dev-local-|

00000080 62 71 65 77 36 6a 79 69 36 32 6c 72 70 71 6a 72 |bqew6jyi62lrpqjr|

00000090 79 63 36 77 6f 61 78 75 37 35 32 6a 76 35 75 32 |yc6woaxu752jv5u2|

000000a0 2f 6c 69 62 65 78 65 63 2f 68 69 63 6d 61 2f 74 |/libexec/hicma/t|

000000b0 65 73 74 00 00 00 00 00 00 00 00 00 00 00 00 00 |est.............|

000000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

000000d0 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

dpotrf_tlr.10217680-3.prof:

00000000 00 00 00 00 00 00 00 00 23 50 41 52 53 45 43 20 |........#PARSEC |

00000010 42 49 4e 41 52 59 20 50 52 4f 46 49 4c 45 20 0c |BINARY PROFILE .|

00000020 00 70 75 49 55 15 00 00 ef cd ab 89 67 45 23 01 |.puIU.......gE#.|

00000030 00 10 00 00 2f 68 6f 6d 65 2f 6f 6d 6f 72 31 2f |..../home/omor1/|

00000040 73 70 61 63 6b 2f 6f 70 74 2f 73 70 61 63 6b 2f |spack/opt/spack/|

00000050 6c 69 6e 75 78 2d 63 65 6e 74 6f 73 38 2d 7a 65 |linux-centos8-ze|

00000060 6e 32 2f 61 6f 63 63 2d 33 2e 30 2e 30 2f 68 69 |n2/aocc-3.0.0/hi|

00000070 63 6d 61 2d 78 2d 64 65 76 2d 6c 6f 63 61 6c 2d |cma-x-dev-local-|

00000080 62 71 65 77 36 6a 79 69 36 32 6c 72 70 71 6a 72 |bqew6jyi62lrpqjr|

00000090 79 63 36 77 6f 61 78 75 37 35 32 6a 76 35 75 32 |yc6woaxu752jv5u2|

000000a0 2f 6c 69 62 65 78 65 63 2f 68 69 63 6d 61 2f 74 |/libexec/hicma/t|

000000b0 65 73 74 00 97 00 00 00 00 10 00 00 00 00 00 00 |est.............|

000000c0 13 00 00 00 00 00 00 00 00 00 a2 12 00 00 00 00 |................|

000000d0 03 00 00 00 80 00 00 00 00 60 a6 12 00 00 00 00 |.........`......|

So all four files have the correct magic, but for node 2 it is most certainly incomplete—note how e.g. the dictionary size, offset, etc. is 0. Note that the segfault was on node 2.

Two things stick out to me. First is that the segfault clearly occurred at at some point midway between writing data. Second is that there are several features in the profiling subsystem that could be conducive to this occurring—namely, that a helper thread can be used, meaning there can be a longer delay between when parsec_profiling_dbp_dump is called and when the write to disk occurs. Second is that, regardless of whether the mmap or write methods are used, the data doesn't get synced to disk—either with msync or fsync.

I think we begin approaching crash consistency territory with this, but an initial suggestion could be to write the magic key last and ensure a sync before/after this write, so that the profile is not valid until after all other data is written. Care should be taken with the mmap method, since I think in that case the helper thread is never actually writing data, just doing the munmap.

— Reply to this email directly, view it on GitHub https://github.com/ICLDisco/parsec/issues/334#issuecomment-1061196788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFEZNWOUC5XXBDLGEYBNTTU6Z6G7ANCNFSM5QEQZQQA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

therault commented 2 years ago

Original bug: a SEGFAULT happened during an execution, and the profile generated looked exploitable, but contained only partial information / trash.

Solution to just that problem:

Why the current code 'works':

Why it's not completely correct:

What was proposed to make things more resilient to SEGFAULTs and other abnormal behaviors:

I made some measurements. For the measurements, the benchmark used is the multichain benchmark in tests/runtime/multichain (slightly modified to improve output and have more control on the behavior). Command line for each run is multichain -i=1000 -j=1000 -v=0 -l=1 -s=0 --mca profile_filename test --mca profile_show_profiling_performance 1 --mca mca_pins task_profiler The benchmark generates about 6 times 1 million tasks, in 1000 chains of 1000 tasks. The backend file is on /tmp of a08, which is a hard drive. The profile file is about 1.3GB. Each run takes about 1min. For each configuration, the run is repeated 20 times, because there is high variability (I/O is not very stable).

Vanilla represents the current code, subject to race condition in case of abnormal termination and where most of the costs of I/O are deferred to close / exit / after the process is dead (not measurable).

vanilla-fsync is a change of the current code where all I/Os scheduled before writing the final header in parsec_profiling_dbp_dump() is forced to complete (on disk) by calling fsync on the backend file descriptor (then update of the magick field in the header, munmap of the header buffer, munmap of parts of the files that were reserved in advance and that are left empty/untouched, and finally close of the file).

msync just adds msync(MS_ASYNC) on each buffer before we call munmap() on it. I tried with msync(MS_SYNC) too, but this slows down the process so much that we enter a high fragmentation of the segment map case and the profiling system fails to complete, some munmap or mmap failing with unable to create segment / unable to allocate memory.

msync-fsync contains the two changes.

Looking at the figure (quickly done with seaborn, so it's not perfect, but it gives a rough idea):

summary

therault commented 2 years ago

See PR https://github.com/ICLDisco/parsec/pull/337

bosilca commented 2 years ago

I can't find anywhere on the mmap/munmap/msync documentation any mention that would suggest that forcing an ASYNC msync would provide better safety than just calling munmap, nor that the OS can discard pages that were unmapped from being written to disk when the process generating these pages crashes. Did you find some pointer to that ?

That being said why do we care about this corner case, we want fast reliable profiling when things go nicely and best support otherwise. For the best support part I think we should dump the signature and the dictionary as soon as possible, and basically do everything as we do today on the runtime side. We should amend the tools to detect an incorrect jump in the linked pages for the profiler and stop analysis of the profile there.

omor1 commented 2 years ago

W.r.t. syncing data to disk when using mmap, the code is indeed linux-specific... [...]

I'm not sure this is true—while it isn't spelled out explicitly, it's certainly implied that updates to shared file-backed mappings cannot be discarded upon munmap if there weren't an msync—the msync just forces the update to occur before the unmap, rather than during or after. I would be incredibly surprised if any kernel didn't do this.

[...] we should dump the signature and the dictionary as soon as possible [...] We should amend the tools to detect an incorrect jump [...]

Doing this seems rather difficult to me. It's probably easier to initialize the magic constant in the file to an invalid value and, as part of the very last operation on the file, update the magic constant to the correct value.

therault commented 2 years ago

On Tue, Mar 8, 2022 at 2:17 PM bosilca @.***> wrote:

I can't find anywhere on the mmap/munmap/msync documentation any mention that would suggest that forcing an ASYNC msync would provide better safety than just calling munmap, nor that the OS can discard pages that were unmapped from being written to disk when the process generating these pages crashes. Did you find some pointer to that ?

Nope. In fact, the man of msync implies by the negative that munmap will eventually force the synchronization. To quote (in msync): 'Without use of this call, there is no guarantee that changes are written back before munmap(2) https://man7.org/linux/man-pages/man2/munmap.2.html is called.'

That being said why do we care about this corner case, we want fast reliable profiling when things go nicely and best support otherwise. For the best support part I think we should dump the signature and the dictionary as soon as possible, and basically do everything as we do today on the runtime side. We should amend the tools to detect an incorrect jump in the linked pages for the profiler and stop analysis of the profile there.

The patch does that (move the MAGICK string writing at a safe place instead of the beginning).

I can remove the two additions. We agree that msync is useless where it is (it's also costless in Linux, so that may depend on the OS implementation). The fsync adds a little more safety for the corner case, at unsure costs, but paid outside of any critical path.

I'll update the PR to get to the simplest code.

omor1 commented 2 years ago

From msync(2)

Without use of this call, there is no guarantee that changes are written back before munmap(2) is called.

I think that the operative word here is before. This doesn't state that the changes are not written back at all, but that there is guarantee of when that occurs. In fact, for a shared filed-backed mapping, updates must occur—at some point. See mmap(2):

MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).)

therault commented 2 years ago

We totally agree on the meaning.

We cannot afford msync(MS_SYNC), it completely destroys the performance, and leads to starvation scenarios where the system eventually fails, even on a reasonable size like the million-task I tried.

msync(MS_ASYNC) would offer no more guarantees than calling munmap. So we can add it to give a hint to the OS, but really, munmap is as good a hint.

So, what is the risk: the file will be corrupted only if the OS eventually fails to synchronize the buffer with the filesystem. For me, that means a failure of the node.

The scenario you observed is a SEGFAULT interrupting the file. And because we had set the MAGICK value in the header during initialization, my guess is that the OS decided to update all the pages it knew about, including the header block, and the file appeared (incorrectly) to be consistent.

In that scenario, now that the MAGICK value is written in the header only after we have munmap all the buffers of data, either the failure occurs after that write, and the OS knows of all the pages to update to make the file consistent, and since the OS is not subject to the failure, it will just complete the dump of the file, or the SEGFAULT happens before that write, and the file will never be considered as consistent.

This is sufficient to ensure a good behavior where the application can crash but not the OS.

What the patch proposed with fsync was to go to a step further: with fsync, we guarantee that even with a failure of the OS, the MAGICK value will always guarantee consistency of the file. Without fsync, the OS / node may fail during the writing of the buffers, and those can happen out-of-order, so the header might be written before some of the data, that could be lost. With the fsync, all the data is on disk before we update the MAGICK value, so we guarantee consistency despite failure of the OS.

That second step is seen as too costly in the normal case (no failure) for the feature it provides.

omor1 commented 2 years ago

Sounds good to me. I agree with George that we don't need good behavior under an OS or node failure situation, but an application failure should not leave a seemingly-valid profile file.

omor1 commented 2 years ago

Closed by #337. As a side note, I saw this in action—I had an application segfault (actually due to a race somewhere in the profiling code, it seems—fl->first got corrupted somehow) and the profiling file for the node on which the application crashed correctly does not have the magic cookie.