apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

`CometBuffer` can potentially lead to concurrent modification of a held buffer (aka is "Unsound" in Rust terms) #1035

Open tustvold opened 4 weeks ago

tustvold commented 4 weeks ago

Describe the bug

It was brought to my attention in https://github.com/apache/arrow-rs/pull/6616 that comet is currently violating the aliasing rules of the Rust compiler. In particular it is mutating memory without exclusive ownership.

The docs on CometBuffer actually call out that the type is unsound - https://github.com/apache/datafusion-comet/blob/main/native/core/src/common/buffer.rs#L166.

This is the underlying cause of https://github.com/apache/datafusion-comet/issues/1030, which is a relatively harmless manifestation of what is ultimately undefined behaviour.

Even putting aside that UB effectively means the program could do literally anything, the exact scenario in #1030 could easily lead to out of bounds memory access, e.g. by unmasking a dictionary key that was previously null and now points to some arbitrary memory location.

I debated filing this ticket, as I wasn't sure how it would be received, but I think it is a sufficiently critical vulnerability that should at the very least be tracked / documented. The way it was being dismissed made me honestly a little uncomfortable. Ultimately CometBuffer is unsound, and there is a concrete example of this unsoundness leading to undefined behaviour in #1030.

Steps to reproduce

Expected behavior

No response

Additional context

FYI @viirya

viirya commented 4 weeks ago

I don't know why this is tagged as bug...

tustvold commented 4 weeks ago

I don't know why this is tagged as bug...

Apologies, I presumed that a security vulnerability was a bug...

viirya commented 4 weeks ago

I debated filing this ticket, as I wasn't sure how it would be received, but I think it is a sufficiently critical vulnerability that should at the very least be tracked / documented. The way it was being dismissed made me honestly a little uncomfortable.

As you mentioned the doc of CometBuffer already documents the unsafe behavior. Not sure why you also said it is not documented.

I also don't know why it was being dismissed. We have dealt with that with deep copying the arrays in the necessary cases. The fact is that the take kernel has the inconsistent behavior not well documented which can easily confuse users. Before digging into its detail, how does an user know it clones a buffer sometimes and not sometimes? So for an user, isn't the first impression that it works well when we don't deep copy the array? How does an user know it has different behavior that we must deep copy for it in corner cases? Especially it is not documented.

alamb commented 4 weeks ago

I tried to update the title of this PR to better reflect what it is reporting.

tustvold commented 4 weeks ago

I am trying to report that your codebase has a critical bug in its handling of memory, without causing unnecessary angst. Modifying buffers behind the back of the Rust compiler is undefined behaviour, and can trivially lead to out of bounds memory access... This is typically considered a serious security defect...

As you mentioned the doc of CometBuffer already documents the unsafe behavior

CometBuffer does document that it has unsafe invariants, this isn't in and of itself an issue. Yes, it is unsound which Rust purists will complain about, but provided nothing uses it in a way that triggers undefined behaviour it isn't the end of the world. The problem is #1030 highlights that it is not being used in a manner that avoids undefined behaviour. This is what I think should be documented, highlighted, and arguably fixed.

Edit: to expand on this, if, for example, comet were instead using into_mutable().unwrap() or similar, #1030 would instead be a runtime panic, which would avoid UB and make tracking down where assumptions are violated much easier.

viirya commented 4 weeks ago

The problem is #1030 highlights that it is not being used in a manner that avoids undefined behaviour

As I mentioned earlier, we did effort to avoid it for the cases we know... But the problem is, nobody knows the kernel behaves inconsistently except for ones already know the details. If we know that in advance, we definitely avoid it like we did for other operations. As I mentioned many times, the kernel behavior is inconsistent and undocumented. It definitely causes user confusion and spend time digging into implementation details just for figuring out why it happens.

viirya commented 4 weeks ago

Edit: to expand on this, if, for example, comet were instead using into_mutable().unwrap() or similar, https://github.com/apache/datafusion-comet/issues/1030 would instead be a runtime panic, which would avoid UB and make tracking down where assumptions are violated much easier.

I believe that Comet scan is developed long before the mutable API comes out... Not to mention that if it is suitable for our cases.

alamb commented 4 weeks ago

I believe that Comet scan is developed long before the mutable API comes out.

I wonder if there might be be some way to use Arc / ref counting to verify at runtime that there are no remaining references to the underlying CometBuffer before it is reused 🤔 Kind of like the try_unary_mut style kernel -- that way it would at least be easier to track down places where the assumptions are invalidated (that there is no more sharing)

viirya commented 4 weeks ago

I wonder if there might be be some way to use Arc / ref counting to verify at runtime that there are no remaining references to the underlying CometBuffer before it is reused

The scan code is not developed by me. I guess that may not work as the CometBuffer internally doesn't use pointer like Arc. It is very low-level raw pointer manipulation. That's why I said the into_mutable may not be suitable.

tustvold commented 4 weeks ago

As I mentioned earlier, we did effort to avoid it for the cases we know... But the problem is, nobody knows the kernel behaves inconsistently except for ones already know the details.

So basically it is safe predicated on undocumented assumptions about third-party code. This is at best optimistic 😅

I dunno, I'm going to disengage at this point but I have to say I have found the response here deeply disappointing and tbh rather troubling. I had hoped there'd at least be an acknowledgement that this API is hard to use safely, but instead the response is to suggest the problem lies in the third-party code for not conforming to your assumptions, and that you'll just add some extra copies to "workaround" this... There are ways to do this safely and I would have hoped that would be seen as a north star, even if not a priority at the moment...

alamb commented 4 weeks ago

Well, I don't presume to know what is / isn't the right design for other systems.

I think this ticket will serve as a good discussion for how to potentially improve the situation and I don't find the discussions or responses troubling.

alamb commented 4 weeks ago

The scan code is not developed by me. I guess that may not work as the CometBuffer internally doesn't use pointer like Arc. It is very low-level raw pointer manipulation. That's why I said the into_mutable may not be suitable.

Makes sense -- thus it sounds like removing this assumption is likely a non trivial effort

What I personally hope / plan to do is to is work with people make the ParquetExec reader in DataFusion so utterly compelling in terms of performance that Comet can consider using that instead of CometBuffer.

Current obsession is making pushdown predicates even faster with @XiangpengHao and @tustvold: https://github.com/apache/arrow-rs/issues/5523

andygrove commented 4 weeks ago

I do think that some valid concerns have been raised here. The questions at the top of my mind right now are:

  1. What additional tests could we add that would have caught this issue? Even if take was meeting the original assumptions today, it could have changed in the future, so how do we ensure that our use of the unsafe CometBuffer is safe and that we don't have regressions if the behavior in arrow-rs changes in the future?
  2. What would an integration with the arrow-rs / datafusion Parquet reader/exec look like? We need to integrate with Spark to read the raw bytes from Spark data sources (could be HDFS, S3, and many others, including custom readers) and would then need to hand those bytes off to native code for decoding. I would certainly like to be able to take advantage of the new Utf8View support.
viirya commented 4 weeks ago

2. What would an integration with the arrow-rs / datafusion Parquet reader/exec look like? We need to integrate with Spark to read the raw bytes from Spark data sources (could be HDFS, S3, and many others, including custom readers) and would then need to hand those bytes off to native code for decoding. I would certainly like to be able to take advantage of the new Utf8View support.

Integration with DataFusion Parquet reader needs to break some designs around current native reader like I mentioned in #1031. For now, it is not a easy transition as simple as to replace A with B. There will be a huge effort on this, not to mention that a JVM reader frontend is important for some cases.

viirya commented 4 weeks ago

I wonder if there might be be some way to use Arc / ref counting to verify at runtime that there are no remaining references to the underlying CometBuffer before it is reused

I just took another look at the related code. Arc / ref counting check won't be effective here, is also because these arrays are passed across JVM/native through C Data interface. Even I try to add an Arc to guard exclusive modification on the buffer data. Once we export the arrays through JVM and import into native, imported buffers won't get the Arc ref counts back but start with new Arc pointers. We cannot get Arc working through import/export arrays.

tustvold commented 3 weeks ago

start with new Arc pointers

This isn't an issue provided the JVM doesn't free or modify the buffers whilst any native code has them. arrow-rs will always treat externally shared memory as immutable, it requires the other participants to do likewise. Once the arrow reference count reaches 0 it will call release, and this can decrement the Java side reference count, or whatever data structure it is using to track this. This is how FFI with pyarrow and arrow-cpp works

viirya commented 3 weeks ago

@alamb suggested if we can use Arc to check if there are no remaining references to the buffer before it is reused. What I said above is that this cannot work as we can not keep Arc reference count through JVM/native. I don't know how it is related to what you are talking about. Of course that's how FFI works, but it doesn't work for the proposed check.

tustvold commented 3 weeks ago

I don't know how it is related to what you are talking about

I interpreted "I wonder if there might be be some way to use Arc / ref counting " as referring to the ref-counting machinery present in Buffer, and was expanding on how it might be used to achieve the stated goal. The machinery to do this safely already exists, it is just a question of hooking it up.

Edit: To expand further as there seems to be a lot of confusion here, the method linked in the original issue description, just needs to pass something instead of Arc::new(0) that prevents the JVM from reusing or freeing the buffer until it is dropped.

viirya commented 3 weeks ago

Edit: To expand further as there seems to be a lot of confusion here, the method linked in the original issue description, just needs to pass something real instead of Arc::new(0) that prevents the JVM from reusing or freeing the buffer until it is dropped.

Huh? When exporting the array from native to JVM, I believe JVM doesn't care about this Allocation parameter. Am I missing something?

tustvold commented 3 weeks ago

I believe JVM doesn't care about this Allocation parameter

Indeed, currently in Comet the JVM doesn't use this parameter, and that is the problem :sweat_smile:

Contrast this with, for example, how arrow-rs handles receiving data over the C Data interface, e.g. for interop with arrow-cpp or pyarrow. It first constructs an FFI_ArrowArray to manage the allocation, and then passes this as the Allocation parameter - https://github.com/apache/arrow-rs/blob/master/arrow-array/src/ffi.rs#L239. This ensures that the memory is not freed or mutated by the external process until all arrow-rs buffers are done with it.

viirya commented 3 weeks ago

I believe that the reason why Allocation parameter is here is because we claim an imported buffer uses custom allocation which arrow-rs won't try to deallocate it when dropping the buffer. I doubt that it is used to make sure the memory is not freed or mutated by external process (actually I doubt how it is possible to guarantee it in any ways). On the contrary, it prevents the memory allocated by external won't be freed by arrow-rs when dropping the imported buffer.

tustvold commented 3 weeks ago

I doubt that it is used to make sure the memory is not freed or mutated by external process (actually I doubt how it is possible to guarantee it in any ways)

Nope that is precisely what it is for - https://arrow.apache.org/docs/format/CDataInterface.html#memory-management

Edit: I realize you could have been saying why Allocation is used in Comet, and yes you are right, however, by using it in this way the JVM has no way to know when it can safely free or reuse the buffer, which is my whole point. If you instead did something similar to what the C Data interface does and gave it an actual allocation object you wouldn't have this issue

viirya commented 3 weeks ago

Nope that is precisely what it is for - https://arrow.apache.org/docs/format/CDataInterface.html#memory-management

I feel that we still are not on the same page...

However, any data pointed to by the struct MUST be allocated and maintained by the producer. Therefore, the consumer MUST not try to interfere with the producer’s handling of these members’ lifetime. The only way the consumer influences data lifetime is by calling the base structure’s release callback.

It clearly states that the data memory is allocated and maintained by the producer. The consumer must not interfere with the lifetime of the memory allocated by the producer. That is exactly what I said above.

viirya commented 3 weeks ago

Okay. I posted the above reply before seeing your addition to previous comment.

viirya commented 3 weeks ago

It first constructs an FFI_ArrowArray to manage the allocation, and then passes this as the Allocation parameter - https://github.com/apache/arrow-rs/blob/master/arrow-array/src/ffi.rs#L239. This ensures that the memory is not freed or mutated by the external process until all arrow-rs buffers are done with it. If you instead did something similar to what the C Data interface does and gave it an actual allocation object you wouldn't have this issue

I'm still not sure what you referred to. It is clear to me that when Deallocation::Custom is given when importing array through FFI, it means it won't interfere with memory deallocation. This is completely following the C Data interface statement about the memory lifetime I quoted above https://github.com/apache/datafusion-comet/issues/1035#issuecomment-2438389166. It has no means to prevent external process to free/mutate the memory.

I also feel it somehow distracts from what was discussed at the beginning.

I wonder if there might be be some way to use Arc / ref counting to verify at runtime that there are no remaining references to the underlying CometBuffer before it is reused

This is what @alamb suggested at the beginning. However, as I posted above, Arc ref counting won't work across JVM/native. We cannot rely on it to do the check for remaining references.

This isn't an issue provided the JVM doesn't free or modify the buffers whilst any native code has them.

But you claimed that is not an issue. I don't know why. And, Comet doesn't free or modify the buffers in JVM.

viirya commented 3 weeks ago

I believe JVM doesn't care about this Allocation parameter

Indeed, currently in Comet the JVM doesn't use this parameter, and that is the problem

Btw, to clarify this, although this is not focus but I think there is confusion. Comet doesn't implement JVM Arrow FFI. It is from Java Arrow.

I remember Java Arrow has similar thing like the arrow-rs Deallocation::Custom to prevent Java Arrow directly releases imported buffer.

tustvold commented 3 weeks ago

Ok I will try one final time to clarify from first principles:

  1. Something modifies the memory of a null buffer whilst an arrow-rs Buffer points to it, violating the Rust memory model (UB)
  2. It has been indicated that this is being done by the JVM parquet reader reusing the buffer
  3. Presuming we want to avoid UB we therefore need a way to stop 2. from happening
  4. This means we need some mechanism for arrow-rs to indicate to the JVM when it no longer holds any references to the data
  5. The ref counting machinery in Buffer provides the ability to register a custom Allocation
  6. When the reference count drops to zero it calls Allocation::drop
  7. This can be used to inform the JVM that the memory is no longer in use by arrow-rs and can be reclaimed / unpinned / reused
  8. Therefore by hooking up Allocation we can resolve the unsoundness of the current API and avoid future UB
  9. The way that arrow-rs hooks up the C Data Interface is an example of how this can be done
viirya commented 3 weeks ago
  1. It has been indicated that this is being done by the JVM parquet reader reusing the buffer

No, It is not JVM reusing the buffer, but the native reader reuses it. But I get your point. Let's replace it with native parquet reader.

  1. The ref counting machinery in Buffer provides the ability to register a custom Allocation
  2. When the reference count drops to zero it calls Allocation::drop
  3. This can be used to inform the JVM that the memory is no longer in use by arrow-rs and can be reclaimed / unpinned / reused
  4. Therefore by hooking up Allocation we can resolve the unsoundness of the current API and avoid future UB
  5. The way that arrow-rs hooks up the C Data Interface is an example of how this can be done

Okay. I got what you said.

I believe that the confusion began from https://github.com/apache/datafusion-comet/issues/1035#issuecomment-2438072519, https://github.com/apache/datafusion-comet/issues/1035#issuecomment-2438149174 and https://github.com/apache/datafusion-comet/issues/1035#issuecomment-2438232880.

I thought you were suggested to provide custom Allocation to JVM and it is confused as JVM doesn't take it when importing arrays through FFI.

What you are suggesting can be simplified to one sentence. Providing a custom Allocation when CometBuffer exports arrow buffer, and using it to detect if exported buffer is dropped from the importer. It sounds making sense. I will give it a try. Thanks.

andygrove commented 3 weeks ago

What I personally hope / plan to do is to is work with people make the ParquetExec reader in DataFusion so utterly compelling in terms of performance that Comet can consider using that instead of CometBuffer.

@alamb @viirya I created a separate issue to continue this discussion:

https://github.com/apache/datafusion-comet/issues/1040