chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

should Chapel handle files on network filesystems differently? #24499

Open mppf opened 7 months ago

mppf commented 7 months ago

This issue is a spin-off of https://github.com/chapel-lang/chapel/issues/24462#issuecomment-1961658918

At present, Chapel files and fileReaders / fileWriters are tied to the locale on which the relevant file is created. (Note that #7953 proposes allowing fileReaders / fileWriters on different locales from the related file). However, when working with some sort of network filesystem (NFS, Lustre, GPFS, etc) in a multi-locale program, all of the locales/compute nodes will typically have access to the same filesystem. Sending all of the data to a particular locale that opened the file in that case is not necessary and in fact is likely to reduce performance. So, in a multi-locale setting with a network filesystem, the most straightforward way to express an I/O pattern is to have each locale use open to create a local open file for the I/O pattern.

However, this runs into two challenges:

  1. It's not the most obvious way to use Chapel's I/O system. A more obvious way to use the I/O system would be to open one file and use that from all locales. However, as mentioned, this is not going to achieve good I/O performance in a network filesystem setting.
  2. When working with filesystems mounted through NFS, if multiple locales are updating the same file, it seems to lead to data corruption issues. See issue #24462 and https://stackoverflow.com/questions/73007716/parallel-writes-to-nfs-backed-file for cases where Chapel programmers have run into challenges that seem to be due to NFS's close-to-open consistency.

A bit more about NFS

From Why NFS Sucks by Olaf Kirch:

An NFS client is permitted to cache changes locally and send them to the server whenever it sees fit. This sort of lazy write-back greatly helps write performance, but the flip side is that everyone else will be blissfully unaware of these change before they hit the server. To make things just a little harder, there is also no requirement for a client to transmit its cached write in any particular fashion, so dirty pages can (and often will be) written out in random order.

I read two implications from this paragraph:

  1. The writes you do on different locales can be observed by the NFS server in an arbitrary order.
  2. These writes are done at an OS page granularity (usually 4k).

Also, note that according the document quoted above, NFSv4 does not solve these problems.

In particular, the data corruption issues observed in #24462 and https://stackoverflow.com/questions/73007716/parallel-writes-to-nfs-backed-file are probably coming from this kind of pattern:

How to handle this better?

The Chapel I/O implementation could identify when a file is being opened on a network filesystem that is available on all locales. If it detects that, it could shift to a different implementation strategy for file I/O:

  1. When creating a fileReader / fileWriter on a locale using a file on another locale, it simply opens a new file with the same path
  2. To support NFS, the fileReader / fileWriters open for a particular file can include some global synchronization to avoid writing to a region to a file that is too near to a region of a file that is open by another fileReader. This could address the NFS consistency issues.

More about (2) from https://github.com/chapel-lang/chapel/issues/24462#issuecomment-1961658918 :

Could [/should] Chapel support a concept of a "network file"? You open the file once in the main task, but fileWriters on other locales could "own" specific byte/page ranges (according to different network file consistency models), allowing them to mostly parallel write to the network filesystem. The owner could default to whoever first owned that page/byte-range without closing it, and transfer ownership to the oldest open fileWriter as tasks wrap up and flush. Any writes to non-owned byte/page-ranges would be communicated to the owning locale to write.

In my case this would mean the long extents of pages that strictly belong to one task would be owned by that task's on locale, but the contested pages would simply be owned by the locale of whichever task opened them first. Most writes would go straight to the NFS, but only contested pages would go through the slower two-hop path to owner and then to the NFS.

Here are some notes about how it's possible to identify if a file path refers to a shared network filesystem for (1):

In an earlier project, I created code to try to automatically detect this situation. It used `statfs` to detect the filesystem type / magic number and used this to determine if it's possible that it's a shared network filesystem. Then, it used `statvfs` to get the filesystem ID. Then it checked to see that these matched on all of the nodes. Lastly, it checked additional information, such as if a file to be accessed exists and has the same size on each node. https://github.com/femto-dev/femto/blob/931b483a007234cc9291bc8705a9bac2255b557c/src/utils/page_utils.c#L384-L417 https://github.com/femto-dev/femto/blob/931b483a007234cc9291bc8705a9bac2255b557c/src/mpi/mpi_utils.cc#L38-L48
brandon-neth commented 6 months ago

We ran into another issue with NFS with Zarr I/O: https://github.com/Cray/chapel-private/issues/6129