dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.03k stars 315 forks source link

[BUG]: .Net Interactive: UDF use deprecated BinaryFormatter, fail in VS Code #795

Open oefe opened 3 years ago

oefe commented 3 years ago

Describe the bug UDFs are serialized using BinaryFormatter, which is deprecated for security reasons.

In environments where BinaryFormatter is already disabled, this causes UDFs to fail with a NotSupportedException.

To Reproduce

Steps to reproduce the behavior:

  1. Use Visual Studio Code with the .Net Interactive Notebooks extension
  2. Follow the steps in UDFs in .NET Interactive
  3. Executing cell #3 (which contains the definition of the UDF) fails with:
Error: System.NotSupportedException: BinaryFormatter serialization and deserialization are disabled within this application. See https://aka.ms/binaryformatter for more information.
at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(Stream serializationStream, Object graph)
at Microsoft.Spark.Utils.CommandSerDe.Serialize(Delegate func, SerializedMode deserializerMode, SerializedMode serializerMode)
at Microsoft.Spark.Sql.Functions.CreateUdf(String name, Delegate execute, PythonEvalType evalType, String returnType)
at Microsoft.Spark.Sql.Functions.CreateUdf[TResult](String name, Delegate execute, PythonEvalType evalType)
at Microsoft.Spark.Sql.Functions.CreateUdf[TResult](String name, Delegate execute)
at Microsoft.Spark.Sql.Functions.Udf[T,TResult](Func`2 udf)
at Submission#11.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

Expected behavior

Screenshots NA

Desktop (please complete the following information):

Additional context NA

oefe commented 3 years ago

The same example also fails in Jupyter, but for a different reason: #796

imback82 commented 3 years ago

Thanks @oefe for reporting this. We will move away from BinaryFormatter from upcoming releases.

imback82 commented 3 years ago

cc @suhsteve @jonsequitur (FYI)

Frassle commented 3 years ago

Not sure what the dotnet plan for replacing BinaryFormatter here is, but I was wanting to use net spark in F# notebooks (where UDFs have never been supported), plus I had a couple of other similar problems that all looked solvable if dotnet had something like pythons cloudpickle. So I wrote one, https://github.com/Ibasa/Pikala. Forked spark to use it and can now run F# UDFs from notebooks.

GeorgeS2019 commented 1 year ago

We will move away from BinaryFormatter from upcoming releases.

@imback82 Is there a solution to this, now is 2023?

dbeavon commented 7 months ago

@GeorgeS2019 @imback82

Interestingly if you target .net standard we don't see the compiler errors: netstandard2.0;netstandard2.1 https://learn.microsoft.com/en-us/dotnet/standard/net-standard?tabs=net-standard-2-1#select-net-standard-version

I think there would still be runtime errors in a project built for .net 8.

The bigger issues is that binaryformatter is bad practice from a security standpoint (and moreover it won't allow AOT compilation of the worker in .net 8!)

UPDATE 4/16:

As I mentioned, you can get past compiler errors simply by building a project in a way that targets .net standard. This will still be compatible with .net 8. See a discussion about compatibility here: https://github.com/dotnet/runtime/discussions/90829

And you can also get past any runtime errors on .Net 8 simply by using these settings in the project file:

    <NoWarn>$(NoWarn);SYSLIB0011</NoWarn>
    <EnableUnsafeBinaryFormatterSerialization>true</EnableUnsafeBinaryFormatterSerialization>

... they can be introduced in .net 8 project files - of our drivers, and of the Microsoft.Spark.Worker as well.

I understand that the binary deserialization is a security threat ... but it seems a bit overblown. It is mitigated when the spark cluster is hosted entirely on a private network and the deserialization is only happening between drivers and executors. In any case, these deserialization-related concerns must be weighed against the needs of spark. There are few other environments that do as much serialization/deserialization as spark does!

I see that the "kryo" serialization also has had well-known vulnerabilities that never stopped it from being used in the Spark ecosystem:

Links regarding kryo: https://stackoverflow.com/questions/40261987/when-to-use-kryo-serialization-in-spark https://www.contrastsecurity.com/security-influencers/serialization-must-die-act-1-kryo-serialization

... I think the ultimate answer is to respect the "vulnerabilities" of binary serialization, and to disable that type of serialization by default. In practice users should be allowed to re-enable the functionality for the sake of convenience and performance. It is a measured risk, that should be left to the application developers.

GeorgeS2019 commented 7 months ago

@dbeavon

From ChatGPT4 =>

The BinaryFormatter in .NET has indeed been marked as obsolete due to security concerns. Here are the key points:

  1. Obsolete Methods:

    • The following methods are now obsolete and produce a compile-time warning with ID SYSLIB0011:
      • BinaryFormatter.Serialize
      • BinaryFormatter.Deserialize
      • Formatter.Serialize (Stream, Object)
      • Formatter.Deserialize (Stream)
      • IFormatter.Serialize (Stream, Object)
      • IFormatter.Deserialize (Stream)
    • In .NET 7, these methods were marked as error.
  2. Recommended Action:

    • Stop using BinaryFormatter in your code.
    • Instead, consider using JsonSerializer or XmlSerializer.
    • If you temporarily need to suppress the BinaryFormatter compile-time warning, you can use #pragma directives around the individual call site or suppress it in the project file¹.
  3. Project Types Affected:

    • In .NET 7, the methods still functioned properly in most project types (excluding ASP.NET, WASM, and MAUI) even after being marked obsolete⁴.
    • However, it's strongly recommended not to continue using BinaryFormatter due to security risks¹.
dbeavon commented 7 months ago

ChatGPT seems pretty wrong on some points. The thing I wanted to point out is that BinaryFormatter seems to be allowed in .Net 8... if nothing else that is for the sake of compatibility with .NetStandard library dependencies. The approach to getting this working is very similar to .Net 7 (use EnableUnsafeBinaryFormatterSerialization, and disable SYSLIB0011 warnings).

The fact that we are dealing with a JVM-based (linux-based) platform means we can give ourselves more flexibility than a typical .Net solution. I would agree that we need to avoid BinaryFormatter by default, but it should probably allowed based on the discretion of users. In any case, I believe that the other serialization libraries which Spark/Jvm uses would also have the security concerns that apply to BinaryFormatter.

grazy27 commented 3 days ago

For .NET interactive withing polyglot notebooks, you can use AppContext.SetSwitch("System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization", true);

BinarySerializer is not that dangerous in Dotnet.Spark context, as UDFs and broadcasts are actually designed to be executable code that can be serialized on driver, and executed on worker. Both python and scala UDFs are doing exactly the same, but in other languages.