dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.03k stars 315 forks source link

[BUG]: TimestampType.FromInternal does not handle nulls #527

Closed borgdylan closed 4 years ago

borgdylan commented 4 years ago

Describe the bug When iterating over a DataFrame with nullable timestamp columns, a NullReferenceException pops up. The issue is with TimestampType.FromInternal which does not contain a null check.

To Reproduce

Steps to reproduce the behavior:

  1. Try foreach-ing the output of ToLocalIterator() on a DataFrame with timestamp columns containing nulls.
  2. Notice the exception.

Expected behavior It is expected that .NET for Spark should support nullable timestamp columsn just like Spark itself already does.

Desktop (please complete the following information):

elvaliuliuliu commented 4 years ago

Thanks for reporting this. I will take a look.

borgdylan commented 4 years ago

This is a major blocker for some of my company's ETL pipelines.

borgdylan commented 4 years ago

I also think that the DateType.FromInternal method suffers from the same bug (inspected using ILSpy).

borgdylan commented 4 years ago

I have added the null checks locally and can verify that it fixes the bug.

borgdylan commented 4 years ago

Below, please find a git patch series (one commit)that fixes the issue:

From c8f422f3fb7de6bf3e4e0b8b7dc8a463928cf832 Mon Sep 17 00:00:00 2001
From: Dylan Borg <borgdylan@hotmail.com>
Date: Wed, 20 May 2020 11:21:51 +0200
Subject: [PATCH] Handle nulls in (DateType|TimestampType).FromInternal - fixes
 #527

---
 src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs b/src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs
index 7b9bd7a..c26a407 100644
--- a/src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs
+++ b/src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs
@@ -81,6 +81,12 @@ namespace Microsoft.Spark.Sql.Types
         /// </summary>
         internal override object FromInternal(object obj)
         {
+            //null pass-through
+            if (obj == null)
+            {
+                return null;
+            }
+
             return new Date(new DateTime((int)obj * TimeSpan.TicksPerDay + s_unixTimeEpoch.Ticks));
         }
     }
@@ -101,6 +107,12 @@ namespace Microsoft.Spark.Sql.Types
         /// </summary>
         internal override object FromInternal(object obj)
         {
+            //null pass-through
+            if (obj == null)
+            {
+                return null;
+            }
+
             // Known issue that if the original type is "long" and its value can be fit into the
             // "int", Pickler will serialize the value as int.
             if (obj is long val)
-- 
2.25.1
imback82 commented 4 years ago

Thanks @borgdylan. We are also tracking this issue here: #436

imback82 commented 4 years ago

@borgdylan Since you already have a fix, do you want to create a PR (with proper tests, etc.)? Otherwise, I can take care of it. Thanks.

borgdylan commented 4 years ago

I do not have enough time to go through the whole process as I have work to carry out. Feel free to use my patch though.

borgdylan commented 4 years ago

My patch can be applied using git am < file.patch where file.patch is the file in which the patch is in. That process will apply a commit equivalent to the one I did in my local copy. Do note that you will need a shell that supports teh < operator like bash or cmd. Powershell does not yet support it.

borgdylan commented 4 years ago

ping

imback82 commented 4 years ago

@elvaliuliuliu do you want to work on this?

elvaliuliuliu commented 4 years ago

@elvaliuliuliu do you want to work on this?

Sure, I will work on this

borgdylan commented 4 years ago

Thanks a million!