DataDog / glommio

Glommio is a thread-per-core crate that makes writing highly parallel asynchronous applications in a thread-per-core architecture easier for rustaceans.
Other
3.08k stars 163 forks source link

Should `DmaFile::write_at` return an error if storage is full? #570

Open vlovich opened 2 years ago

vlovich commented 2 years ago

write_at is documented as

Note that it is legal to return fewer bytes than the buffer size. That is the situation, for example, when the device runs out of space (See the man page for write(2) for details)

However, this feels like it should generate a an GlommioError that has an IO Error of kind std::io::ErrorKind::StorageFull as most users will want to propagate "disk full" as an error rather than needing to check if number of bytes written == number of bytes sent after each write. The error can contain additionally contain the number of bytes that were successfully written but fundamentally I think "storage full - cannot fulfill write" seems like an error case and not a success.

Thoughts?

glommer commented 2 years ago

The problem is that the semantics of the write system call are different than that.

It says:

Note that a successful write() may transfer fewer than count bytes. Such partial writes can occur for various reasons; for example...

(emphasis mine, and then they proceed to list examples, including disk full)

So from the principle of least surprise, it is a bit dangerous to just convert short writes to storage full. There is some value in returning storage full at this level of the API, but it has to be done with more care (if you want to take a stab at it - I am not opposed, but frankly, don't see this as super crucial. Would happily merge).

If we receive a short write, then we can retry once inside the API. If the storage is really full, it will then return a failure and ENOSPC that we can convert to a storage full error.

However, it could very well be that you see a short write, try again, and get a successful short write again - that happens for other reasons.

vlovich commented 2 years ago

Hmm... is there any documentation for how io_uring works with respect to interrupts? I would have thought that the io_uring syscalls can get interrupted but the actual scheduled I/O would never interrupt and thus short writes aren't possible...

HippoBaro commented 2 years ago

Once a request is pushed to the ring, it cannot be interrupted in the traditional sense of the word (via a signal for instance), because the kernel is processing it outside of user space.

However, this doesn't mean short writes aren't possible. Short writes can happen for a variety of reasons. Some filesystems will behave differently than others, the kernel can throttle writes to improve reads if it is configured to do so or maybe the drive you are writing to is remote and a socket is used to reach it at the driver level, etc, etc.

In summary, since the kernel API allows for short writes, you should expect them to occur without interpreting them as errors.

vlovich commented 2 years ago

Are the write amounts at least guaranteed to be aligned or do I have to align_down and rewrite the part I did?

Is there any examples of recommended ways to handle short writes? I want to make sure I'm writing the code robustly and efficiently. Do you usually just ignore short writes as never happening in your own code? As far as I know, once I call write_at I transfer ownership of the buffer. Does that mean that I have to maintain a source copy of the buffer so that I can retry short writes or is there a better way to maintain 0-copy (or maybe the DmaBuffer isn't actually gone within the framework and it could retry short writes)?

I'm a little surprised that the kernel would throttle writes by abandoning them part-way instead of deferring them for later, but also I would think that's still an error scenario from a developer perspective. The remote case isn't relevant for me so I can ignore that aspect. Are there any others to think about?

glommer commented 2 years ago

To my knowledge, I assume they are aligned, but we should assert. Asserting is bad, but it is less bad than silently breaking something.

To be clear, for async writes, I don't think they can be interrupted. The example that hippo gave on sockets, are clearly not something that matters for storage write, etc.

So if you ask me what will happen if something returns a successful short write, you retry once, and then you will ENOSPC. No other scenario will materialize.

But the part that scares me, and would prevent me from backing an API that never returns short writes at all, is the fact that the man page for write is incredibly vague. Someone working on the kernel API layer, may certainly treat this as a license to return a short write if a need ever arise.

vlovich commented 2 years ago

As I understand it, typically the Linux kernel policy is that documentation or not, changes that break "correct" observable user-space behavior aren't allowed of course what is "correct" is a mixture of man pages and common sense of Linus & other devs so who knows how they'll rule on some hypothetical future breakage.

Given that you want the end user to handle a short write, it still feels like it should be a GlommioError::ShortWrite error that reports the number of bytes written AND contains the original buffer, no? Otherwise how do I implement recovery from short writes as an application author if I'm adopting 0-copy and using Glommio-provided buffers for filling up all I/O?

glommer commented 2 years ago

Yes, there's usually this policy, but because short writes are already present, the usual thing to do upon a short write is to try again. It is not unreasonable to assume that programs would be testing for short write and retrying.

I feel like from our PoV, if we want to say that this is an error GlommioError::ShortWrite and we want to return that as an error versus success, it is a reasonable thing to do.

vlovich commented 2 years ago

Should the error handling for the short write be in Source, DmaFile or somewhere else? I was looking at uring.rs and maybe it's in process_one_event but the generic process_one_event method makes me uneasy that maybe I should put it in the callback instead of handling it directly within that instead.

The other bit of complexity is how to hand back the original DmaBuffer. Since the beginning has been written, the end user would expect to call write_at(buf, old_pos + length of short write) but I'm not sure how to truncate the old buf length - is that what the internal trim variable is supposed to represent? Similarly, not clear to me where I should be getting back ownership of the buffer that I handed off.

glommer commented 2 years ago

This should not be at the Source level. This is a higher level API decision, breaking with what the OS provides should happen as high as possible.

Because we ingest the buffer, it is not currently possible (with this API) to return it. We considered in the past returning Result<DmaBuffer> from this API. If a short write becomes an error, that actually lifts the major reason to return usize instead of returning the buffer back, so we could do that.

vlovich commented 1 year ago

So just so I'm making sure I'm understanding properly. Change the return type so that Source returns Result<DmaBuffer> and then DmaFile would be responsible for detecting short reads?

I think that works but there's a couple more nuances:

  1. How do I truncate the DmaBuffer to indicate that the beginning part of the buffer no longer needs to be written? Is that what the unused trim variable is supposed to represent (len() appears to not take it into account so I think maybe it's for some other purpose? might just be a bug but not sure since I couldn't find any usage of that variable)
  2. Presumably Source would need to return Result<usize, DmaBuffer> to return how much was written, no? Or are you envisioning that it's the callers responsibility to remember how much was sent in and figuring out a length based on how much was returned as unwritten?
glommer commented 1 year ago

hey @vlovich .

A couple of comments:

My intuition here is that we should return the whole buffer back. If we return an error containing the length that was effectively written, then it's the user's responsibility to deal with how to retry (you could retry the whole buffer again from the previous position after freeing up some space, for example).

About 2, I can see implementations going both ways, but very likely we'll have to return the DmaBuffer back in the source, yes, due to Rust ownership rules. Not doing that is likely possible, but my guess is that it wouldn't be the easiest.