NetApp / trident

Storage orchestrator for containers
Apache License 2.0
727 stars 218 forks source link

Differences from POSIX #879

Open nemobis opened 6 months ago

nemobis commented 6 months ago

Describe the solution you'd like I can't find any mention of POSIX compliance in the Trident documentation or connected docs.

It would be nice to have a page in the documentation describing how to maximise POSIX compliance (for example, does it help to enable NFSv4 in Ontap?) and what are the remaining differences from the POSIX standard.

Describe alternatives you've considered I've considered using a locally mounted filesystem (for example through Longhorn) or CephFS (see CephFS on differences from POSIX).

Additional context I'm testing Prometheus with a Trident CSI persistentvolume. The Prometheus docs state that a POSIX compliant filesystem is required and that most NFS implementations are discouraged.

wonderland commented 6 months ago

I personally don't have such a document, but I'm sharing some of my experience, hope it helps...

TL;DR

Trident provides shared storage (with NFS or SMB as the protocol) as well as block storage (with iSCSI or NVMe/TCP as the protocol). The later will have a local file system (such as ext4 or XFS). You can use both in parallel and from the exact same Ontap storage system. In general, if a workloads says it requires a local file system (e.g. block storage) I'd just use that rather than arguing that a shared file system (such as NFS) works as well (which it typically would).

The details...

NFS by nature is a shared file system (e.g. can be accessed by multiple clients in parallel). That conflicts with some of the requirements POSIX has for file systems. (And yes, some of those requirements are very weird and unexpected. Such as being able to continue using a file that you previously deleted).

Linux itself (which I assume you are using) is NOT POSIX compliant, unless you are a running a more obscure distributions such as EulerOS which is specifically modified to be 100% POSIX compliant. Linux is generally considered "mostly POSIX compliant". As a simple test, run "ls . -a" on your Linux. It will list all files/directories of the current folder. On a POSIX compliant system (MacOS might be one you have access to) this would return an error such as "-a: No such file or directory" as POSIX demands that the "-a" is treated as an operand, not an option. If even your OS is not fully POSIX compliant, does the file system have to be - or is "mostly POSIX compliant" ok as well? Ontap NFS is "mostly POSIX compliant" (NFSv4 recommended) and also offers specific settings to overcome some of the more common "issues" such as "silly rename".

There are big differences between NFS server implementations and a Linux workstation running a NFS server is different from running NFS on an enterprise storage such as Ontap. Application developers tend to not be aware of these differences, which results in recommendations against NFS in the docs.

Trident is a CSI driver, it does not provide the storage and/or file system. I don't think Trident docs would be the right place to cover POSIX compliance as this is nothing provided by Trident.

I wouldn't read too much into CephFS trying to explain why their proprietary file system is better than an open standard such as NFS ;-)

From my experience, Prometheus runs just fine with Ontap NFS from a functional perspective. However, the Prometheus workload might not be ideal for a remote file system from a performance perspective. It creates/updates a very large number of small files. This includes metadata operations (such as file creation, timestamps,...) as well as the actual data being written. The metadata overhead is very large if the data itself is only a few bytes. With a shared/remote file system the metadata operations must go over the wire and are subject to the network latency. With a local file system (which can be backed by network storage, e.g. block protocols such as iSCSI, NVMe) this happens locally and without any network latency, resulting in better performance. Depending on the size of your Prometheus setup you might get better performance with block storage than with NFS. This is very different from most other workloads (such as SQL databases) where NFS would deliver the same performance as a block protocol.

nemobis commented 6 months ago

Thanks @wonderland, I was hoping you'd answer this! (I saw some previous comment of yours about Trident in kubernetes.)

Indeed "full POSIX compliance" is somewhat a myth, but I hope "differences from POSIX" is a sufficiently reasonable framing for people to make their own informed decisions.

I don't think Trident docs would be the right place to cover POSIX compliance as this is nothing provided by Trident.

Fair enough. I didn't realise that Trident supports so many different backends.

I thought the drivers might have some common feature/default setting with some impact on POSIX compliance. If not, maybe that's something worth mentioning somewhere in the design and architecture guide, ideally with some pointers on where to find more information? For example I see that in the Storage configuration section there are links to documents on how to deal with so-called POSIX ACLs. There's probably some useful documentation somewhere in the depths of the NetApp website or elsewhere (for example I found NFStest) but specific pointers are very useful.

Concretely speaking, if I'm developing something in Kubernetes and I'm told I can use an ONTAP backend through Trident, I land on a page like Choosing a driver and I'm not sure whether I want, say, ontap-nas/NFS or ontap-san/iSCSI with xfs, or whether there are default mount options I should care about.

I wouldn't read too much into CephFS trying to explain why their proprietary file system is better than an open standard such as NFS ;-)

Eheh. Still the page seems useful to reason about some details. For example when Prometheus devs proclaim a need for full POSIX compliance I doubt they mean that Prometheus cares about .snap files or is opinionated about NFSv4.x ACLs (or even the atime field).

From my experience, Prometheus runs just fine with Ontap NFS from a functional perspective. However, the Prometheus workload might not be ideal for a remote file system from a performance perspective.

Good to know!

Initially I was worried about the performance penalty but I now expect that's going to be easy enough to measure and counteract. For example I'm throwing 3 times more memory at Prometheus on NFSv4 Trident than I used to with a similar load on a VM with xfs, and I can plan for slower startups.

What worries me the most is whether something is just going to catastrophically fail upon a non-graceful shutdown or things like that. For example recent Prometheus versions have improved on recovery from corrupt chunks etc. (https://github.com/prometheus/prometheus/pull/10316 https://github.com/prometheus/prometheus/pull/9856 https://github.com/prometheus/prometheus/pull/12500) but it's hard to tell which assumptions are made about write/delete queues, mmap calls, truncations etc. without knowing what to look for.

nemobis commented 6 months ago

In general, if a workloads says it requires a local file system (e.g. block storage) I'd just use that rather than arguing that a shared file system (such as NFS) works as well

OTOH... I guess it all boils down to this and it would be enough to have this advice somewhere prominent in the docs. :)

scaleoutsean commented 2 weeks ago

no persistent volumes available for this claim and no storage class is set

The linked issue says "no storage class is set". POSIX-SCHMOSIX. How is an app supposed to work when there's no SC it can use?