dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 312 forks source link

Unit testing c# code that is reliant on DataFrame and DataFrameReader #637

Closed ccard closed 4 years ago

ccard commented 4 years ago

I am using Micorsoft.Spark.Sql in my C# spark project and I am trying to develop unit tests for my classes that rely on manipulating DataFrames.

I am running into issues trying to mock or fake out either SparkSession or DataFrames since they are sealed classes. Ideally i would like to create a SparkSession object that doesn't rely on being connected to an external source to avoid test instability. I have haven't found any resources that would indicate how to accomplish this so far. I have looked at the test classes in this repo which accomplish this but all the objects that these tests mock are not available to consumers of the public nuget package.

I would like to eventually be able to write code in my tests like following code snipt from this post

var data = new List<GenericRow>();
data.Add(new GenericRow(new object[] { "Alice", new Date(2020, 1, 1) }));
data.Add(new GenericRow(new object[] { "Bob", new Date(2020, 1, 2) }));

var schema = new StructType(new List<StructField>()
{
    new StructField("name", new StringType()),
    new StructField("date", new DateType())
});

DataFrame df = spark.CreateDataFrame(data, schema);

I would appreciate some help either creating a SparkSession that doesn't rely on an external connection or how to mock the SparkSession/DataFrame object. I want to create unit test that don't rely on external connections if possible.

Does any one know if this is possible or have done it before?

Niharikadutta commented 4 years ago

Have you taken a look at this class for ideas?

imback82 commented 4 years ago

@ccard, If you want to completely remove the external connection, you could implement your version of IJvmBridge and IJvmBridgeFactory and inject them to SparkEnvironment at the start of your application. Please check this: https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Interop/SparkEnvironment.cs

ccard commented 4 years ago

@imback82 Thanks for the example but i should clarify the code i am writing is an external consumer of this repository and not a contributor. The example you gave is pretty much what i need to accomplish but in code that cannot see the internal interfaces or classes of this code base.

@Niharikadutta Thanks for the example as well but i have the same problem as mentioned above.

suhsteve commented 4 years ago

An example of how to mock the IJVmBridge and IJvmBridgeFactory. To expose Microsoft.Spark internals to your project, I would suggest trying out IgnoresAccessChecksToGenerator

ccard commented 4 years ago

Thank you all for your pointers and suggestions, they have gotten me a lot farther than other resources have.

@suhsteve Thanks for your suggestion it would have worked if the namespace Micrsoft.Spark.Interop was published with the package Microsoft.Spark and the constructor for SparkSession was not interanl.

With all of your suggestions i have come to find that the published package as is not mockable by a consuming project (unless a paid license mocking library like Typemock or JustMock are used, they allow mocking of sealed classes).

In order to make this repo more testable through mocks there are two approaches i see.

If the suggestion @suhsteve provided was the route to take (which i think would support most consumers of this package to unit test without external connections) then the following changes are needed.

  1. The package Micrsoft.Spark.Interop would need be published (so that the IJVmBridge and IJvmBridgeFactory can be mocked or extended)
  2. The access modifier on the constructor for SparkSession needs to be changed from internal to public so the mocked interfaces in point 1 can be passed in.

The other way change is to un-seal the DataFrame class so it can be mocked (This very specific to my scenario and may require more classes to be unsealed) change public sealed class DataFrame to public class DataFrame.

As the published package currently stands its not very testable unless you want to pay a licence fee to a MockSDK that allows you to mock sealed classes.

What is the best way to make a feature request to make this package more test friendly to code that consumes it? leave it on this thread or create a feature request?

suhsteve commented 4 years ago

@ccard I don't know where you're running into issues. But please see this example project and let me know if this is what you are looking for. I'm able to create a mock of IJvmBridge and also access SparkSession constructor.

Program.cs:

using Microsoft.Spark.Interop;
using Microsoft.Spark.Interop.Ipc;
using Microsoft.Spark.Sql;
using Moq;

namespace example
{
    class Program
    {
        static void Main(string[] args)
        {
            var mockJvm = new Mock<IJvmBridge>();
            mockJvm
                .Setup(m => m.CallStaticJavaMethod(
                    It.IsAny<string>(),
                    It.IsAny<string>(),
                    It.IsAny<object>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));

            var mockJvmBridgeFactory = new Mock<IJvmBridgeFactory>();
            mockJvmBridgeFactory
                .Setup(m => m.Create(It.IsAny<int>()))
                .Returns(mockJvm.Object);

            SparkEnvironment.JvmBridgeFactory = mockJvmBridgeFactory.Object;

            // SparkSession is accessible.
            SparkSession s = new SparkSession(null);
        }
    }
}

example.csproj:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.1</TargetFramework>
  </PropertyGroup>

  <PropertyGroup>
    <InternalsAssemblyNames>Microsoft.Spark</InternalsAssemblyNames>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Microsoft.Spark" Version="0.12.1" />
    <PackageReference Include="Moq" Version="4.14.5" />
    <PackageReference Include="IgnoresAccessChecksToGenerator" Version="0.4.0" PrivateAssets="All" />
  </ItemGroup>

</Project>
ccard commented 4 years ago

@suhsteve thank you for posting that example. I had missed the "PrivateAssets="All"" argument in the package references and now i am able to get your example working with one exception. I am getting the error that IJvmBridgeFActory doesn't exist even though i included both of the namespaces you included in your example

error CS0246: The type or namespace name 'IJvmBridgeFactory' could not be found (are you missing a using directive or an assembly reference?)
error CS1503: Argument 1: cannot convert from 'Microsoft.Spark.Interop.Ipc.IJvmBridge' to '?'
 error CS0117: 'SparkEnvironment' does not contain a definition for 'JvmBridgeFactory'
suhsteve commented 4 years ago

The example I pasted builds fine for me. What version of Microsoft.Spark are you using ?

ccard commented 4 years ago

@suhsteve Here is my proj file

<PropertyGroup>
    <TargetFrameworks>netcoreapp2.1</TargetFrameworks>
  </PropertyGroup>

  <ItemGroup>
    <Content Remove="appsettings.json" />
  </ItemGroup>

  <PropertyGroup>
    <InternalsAssemblyNames>Microsoft.Spark</InternalsAssemblyNames>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IgnoresAccessChecksToGenerator" Version="0.4.0" PrivateAssets="All" />
    <PackageReference Include="Microsoft.NET.Test.Sdk" Version="16.5.0" />
    <PackageReference Include="Microsoft.Spark" Version="0.12.1" />
    <PackageReference Include="Moq" Version="4.14.5" />
    <PackageReference Include="MSTest.TestAdapter" Version="2.1.0" />
    <PackageReference Include="MSTest.TestFramework" Version="2.1.0" />
    <PackageReference Include="coverlet.collector" Version="1.2.0" />
  </ItemGroup>

But one of the projects that i use in a project below this one pulls in 0.9.0.0 from a dependency from another package

Project A Packages:

Unit TestProject <- This is the one i am trying write the mocks in Packages:

maybe i am getting version dependency issues?

ccard commented 4 years ago

@suhsteve your suggestions and help where fantastic. I now have it working with the below code snipit for version Microsoft.Spark(0.9.0.0). Thank you for your help i really appreciate it.

var mockJvm = new Mock<IJvmBridge>();
            mockJvm
                .Setup(m => m.CallStaticJavaMethod(
                    It.IsAny<string>(),
                    It.IsAny<string>(),
                    It.IsAny<object>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));
            mockJvm
                .Setup(m => m.CallStaticJavaMethod(
                    It.IsAny<string>(),
                    It.IsAny<string>(),
                    It.IsAny<object>(),
                    It.IsAny<object>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));
            mockJvm
                .Setup(m => m.CallStaticJavaMethod(
                    It.IsAny<string>(),
                    It.IsAny<string>(),
                    It.IsAny<object[]>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));

            mockJvm
                .Setup(m => m.CallNonStaticJavaMethod(
                    It.IsAny<JvmObjectReference>(),
                    It.IsAny<string>(),
                    It.IsAny<object>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));
            mockJvm
                .Setup(m => m.CallNonStaticJavaMethod(
                    It.IsAny<JvmObjectReference>(),
                    It.IsAny<string>(),
                    It.IsAny<object>(),
                    It.IsAny<object>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));
            mockJvm
                .Setup(m => m.CallNonStaticJavaMethod(
                    It.IsAny<JvmObjectReference>(),
                    It.IsAny<string>(),
                    It.IsAny<object[]>()))
                .Returns(
                    new JvmObjectReference("result", mockJvm.Object));

            SparkEnvironment.JvmBridge = mockJvm.Object;

            // SparkSession is accessible.
            SparkSession s = SparkSession.Active();

            var data = new List<GenericRow>();
            data.Add(new GenericRow(new object[] { "Alice", new Date(2020, 1, 1) }));
            data.Add(new GenericRow(new object[] { "Bob", new Date(2020, 1, 2) }));

            var schema = new StructType(new List<StructField>()
                {
                    new StructField("name", new StringType()),
                    new StructField("date", new DateType()),
                });

            DataFrame df = s.CreateDataFrame(data, schema);
mstrate commented 1 year ago

@ccard - Thanks for all of your questions on this. I was able to get this to compile, however, the DataFrame returned from CreateDataFrame doesn't contain the data or the schema provided. Did it actually produce the correct results for you?