dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 312 forks source link

JVM IPC Deserialization uses BinaryFormatter, which is now Deprecated for OWASP CWE #1131

Closed cutecycle closed 6 months ago

cutecycle commented 1 year ago

Describe the bug We'll be getting SYSLIB0011 errors for the way the Broadcast, Worker, and RDD are formatting streams.

My current understanding:

Historical example: jni4net (also uses BinaryFormatter, but has a some level of struct definition: *https://github.com/jni4net/jni4net/blob/ac2189c37253710e7b729797631419b0bf3b8559/jni4net.tested.n/src/generated/net/sf/jni4net/tested/JavaInstanceFields.generated.cs#L16

Notes: Can we write directly to MemoryStream? https://github.com/dotnet/spark/pull/1112#discussion_r1094785106

"passing protobufs between Java and C using JNI": https://medium.com/@dhaval.durve/passing-protobufs-between-java-and-native-c-code-using-jni-9808b60f6d2c

An equivalent of this CVE, and the object filter used to resolve it https://security.snyk.io/vuln/SNYK-PYTHON-PYSPARK-3021140

https://github.com/apache/spark/pull/18166/files#diff-6a1d1601920af68466d7c30dc02170468abbe408138734c00d50d2ba1b81ba35R179

BinaryFormatter Guidance: https://learn.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide

Arrow buffers: https://arrow.apache.org/docs/python/ipc.html

BinaryFormatter Marshaller in ProtoBuf.net: https://github.com/protobuf-net/protobuf-net.Grpc/blob/main/tests/protobuf-net.Grpc.Test.Integration/CustomMarshaller.cs

Protobuf scalar bytes for arbitrary byte lengths: https://developers.google.com/protocol-buffers/docs/proto3#scalar

Wind down plan in dotnet: https://github.com/dotnet/designs/pull/141/commits/bd0a0661f9d248ed31a354d27ad026efd6719690

"Is binary serialization inherently unsafe?" https://stackoverflow.com/a/66825699

pyspark's implementation of this is based on py4j; they were going to use protobuf but opted for strings https://github.com/py4j/py4j/blob/b4514ecd40ea121a35f9cf50bbf2ccea95354245/py4j-python/src/py4j/protocol.py#L9 https://github.com/py4j/py4j/blob/1f8a0b6dc216f16092d9c1b2556897eec8653a62/py4j-python/src/py4j/java_gateway.py#L1737

Though I will say... things seem to be BinaryFormatter all the way down? https://github.com/protobuf-net/protobuf-net/search?q=binaryformatter

cutecycle commented 1 year ago

referencing #795

arsdragonfly commented 6 months ago

This is entirely pointless for UDFs. UDFs execute arbitrary code remotely on workers by design. The whole entire point of UDF on Spark is distributed RCE. Whatever formatter you use does not change the fact that the payload is, and has to be arbitrary computation.

cutecycle commented 6 months ago

That makes sense!