Open bamurtaugh opened 5 years ago
@bamurtaugh Thank you for reporting! Could you share your complete app, spark-submit command you are using and the full C# code?
My app is called mySparkAppAdv. The spark-submit command I'm using is:
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp2.2\microsoft-spark-2.4.x-0.4.0.jar dotnet bin\Debug\netcoreapp2.2\mySparkAppAdv.dll
The app itself is:
using System;
using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;
namespace mySparkAppAdv
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession.Builder().AppName("GitHub and Spark Batch").GetOrCreate();
DataFrame projectsDf = spark.Read()
.Schema("id INT, url STRING, owner_id INT, name STRING, descriptor STRING, language STRING, created_at STRING, forked_from INT, deleted STRING, updated_at STRING")
.Csv("projects_smaller.csv");
projectsDf.Show();
// Drop any rows with NA values
DataFrameNaFunctions dropEmptyProjects = projectsDf.Na();
DataFrame cleanedProjects = dropEmptyProjects.Drop("any");
// Remove unnecessary columns
cleanedProjects = cleanedProjects.Drop("id", "url", "owner_id");
cleanedProjects.Show();
// Get the average number of times each language has been forked
DataFrame groupedDF = cleanedProjects.GroupBy("language").Agg(Avg(cleanedProjects["forked_from"]));
// Sort by most forked languages first
groupedDF.OrderBy(Desc("avg(forked_from)")).Show();
// Find projects updated since 10/20/15
spark.Udf().Register<string, bool>("MyUDF", (date) => DateTest(date));
cleanedProjects.CreateOrReplaceTempView("dateView");
DataFrame dateDf = spark.Sql("SELECT *, MyUDF(dateView.updated_at) AS datebefore FROM dateView");
// Only keep more recently updated projects
DataFrame filteredDates = dateDf.Filter(dateDf["datebefore"] == true);
filteredDates.Show();
spark.Stop();
}
public static bool DateTest(string date)
{
// Remove invalid dates to avoid: System.FormatException: String '0000-00-00 00:00:00' was not recognized as a valid DateTime
if (date.Equals("0000-00-00 00:00:00"))
return false;
DateTime convertedDate = Convert.ToDateTime(date);
// 10/20/2015
DateTime referenceDate = new DateTime(2015, 10, 20);
// > 0 means convertedDate (input from file) is later than 10/20/15
if (DateTime.Compare(convertedDate, referenceDate) > 0)
return true;
else
return false;
}
}
}
After most of my program executes/produces output as expected, I get the following output after showing the grouped Dataframe (please note that the Dataframe appears correctly in my output console, just not when I pasted it as a comment here):
[2019-09-27T22:29:30.3677119Z] [MININT-M0C5UII] [Debug] [ConfigurationService] Using the environment variable to construct .NET worker path: C:\bin\Microsoft.Spark.Worker-0.4.0\Microsoft.Spark.Worker.exe.
DotnetWorker PID:[18248] Args:[-m pyspark.worker] SparkVersion:[2.4.1]
[2019-09-27T22:29:31.5888846Z] [MININT-M0C5UII] [Info] [SimpleWorker] RunSimpleWorker() is starting with port = 53308.
[2019-09-27T22:29:31.6375263Z] [MININT-M0C5UII] [Info] [TaskRunner] [0] Starting with ReuseSocket[False].
DotnetWorker PID:[12344] Args:[-m pyspark.worker] SparkVersion:[2.4.1]
[2019-09-27T22:29:31.8081895Z] [MININT-M0C5UII] [Info] [SimpleWorker] RunSimpleWorker() is starting with port = 53310.
[2019-09-27T22:29:31.8605830Z] [MININT-M0C5UII] [Info] [TaskRunner] [0] Starting with ReuseSocket[False].
+--------------------+--------------------+----------+----------------+-----------+-------+----------------+----------+
| name| descriptor| language| created_at|forked_from|deleted| updated_at|datebefore|
+--------------------+--------------------+----------+----------------+-----------+-------+----------------+----------+
| ruote-kit|RESTful wrapper f...| Ruby| 12/8/2009 10:17| 2| 0| 11/5/2015 1:15| true|
| cocos2d-x|Port of cocos2d-i...| C++| 3/12/2012 16:48| 6| 0|10/22/2015 17:36| true|
| cocos2d-x|Port of cocos2d-i...| C| 4/23/2012 10:20| 6| 0| 11/1/2015 17:32| true|
| rake-compiler|Provide a standar...| Ruby| 8/1/2012 18:33| 14556189| 0| 11/3/2015 19:30| true|
| cobertura-plugin|Jenkins cobertura...| Java| 7/26/2012 18:46| 193522| 0| 11/1/2015 19:55| true|
| scala-vs-erlang|A performance tes...| Erlang|12/25/2011 13:51| 1262879| 0| 10/22/2015 4:50| true|
| opencv|OpenCV GitHub Mirror| C++| 8/2/2012 12:50| 29| 0| 10/26/2015 6:44| true|
| redmine_git_hosting|A ChiliProject/Re...| Ruby| 7/30/2012 12:53| 42| 0|10/28/2015 10:54| true|
| redmine_git_hosting|A ChiliProject/Re...| Ruby|10/26/2011 23:17| 207450| 0|10/27/2015 22:43| true|
| protobuf-cmake|CMake build suppo...| \N| 8/2/2012 14:35| 61| 0| 10/31/2015 1:22| true|
|willdurand.github...| My new blog!|JavaScript| 8/2/2012 12:06| 84| 0| 11/4/2015 9:15| true|
| libuv|platform layer fo...| C| 8/2/2012 12:57| 74| 0| 10/31/2015 8:21| true|
| cucumber-js|Pure Javascript C...|JavaScript| 6/21/2012 11:47| 96| 0| 11/4/2015 21:42| true|
| i18n-js|It's a small libr...|JavaScript| 8/2/2012 12:58| 92| 0|10/29/2015 13:19| true|
| libuv|platform layer fo...| C| 7/9/2012 12:56| 74| 0|10/31/2015 20:53| true|
| angular-seed|Seed project for ...|JavaScript| 5/19/2012 6:30| 100| 0| 11/2/2015 14:21| true|
| addon-sdk|The Add-on SDK re...| Python|11/29/2010 13:17| 127| 0|10/31/2015 20:10| true|
| xinput_calibrator|A generic touchsc...| C++| 8/2/2012 12:59| 128| 0| 11/1/2015 14:27| true|
| addon-sdk|The Add-on SDK re...|JavaScript| 4/7/2011 17:58| 127| 0| 11/6/2015 11:03| true|
| ntpl| ntpl| Python| 8/1/2012 16:50| 134| 0|10/25/2015 16:03| true|
+--------------------+--------------------+----------+----------------+-----------+-------+----------------+----------+
only showing top 20 rows
[2019-09-27T22:29:34.7433042Z] [MININT-M0C5UII] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.ArgumentException: The stream is closed.
at Microsoft.Spark.Utils.PythonSerDe.GetUnpickledObjects(Stream stream, Int32 messageLength) in /_/src/csharp/Microsoft.Spark/Utils/PythonSerDe.cs:line 49
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 133
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 153
[2019-09-27T22:29:34.7434683Z] [MININT-M0C5UII] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.ArgumentException: The stream is closed.
at Microsoft.Spark.Utils.PythonSerDe.GetUnpickledObjects(Stream stream, Int32 messageLength) in /_/src/csharp/Microsoft.Spark/Utils/PythonSerDe.cs:line 49
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 133
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 153
[2019-09-27T22:29:34.7438227Z] [MININT-M0C5UII] [Error] [TaskRunner] [0] Exiting with exception: System.ArgumentException: The stream is closed.
at Microsoft.Spark.Utils.PythonSerDe.GetUnpickledObjects(Stream stream, Int32 messageLength) in /_/src/csharp/Microsoft.Spark/Utils/PythonSerDe.cs:line 49
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 133
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 153
at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 65
[2019-09-27T22:29:34.7447989Z] [MININT-M0C5UII] [Info] [TaskRunner] [0] Finished running 0 task(s).
[2019-09-27T22:29:34.7448496Z] [MININT-M0C5UII] [Info] [SimpleWorker] Ru[2019-09-27T22:29:34.7436884Z] [MININT-M0C5UII] [Error] [TaskRunner] [0] Exiting with exception: System.ArgumentException: The stream is closed.
at Microsoft.Spark.Utils.PythonSerDe.GetUnpickledObjects(Stream stream, Int32 messageLength) in /_/src/csharp/Microsoft.Spark/Utils/PythonSerDe.cs:line 49
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 133
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 153
at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 65
[2019-09-27T22:29:34.7447995Z] [MININT-M0C5UII] [Info] [TaskRunner] [0] Finished running 0 task(s).
[2019-09-27T22:29:34.7448511Z] [MININT-M0C5UII] [Info] [SimpleWorker] RunSimpleWorker() finished successfully
From what I see is that only when you run filteredDates.Show();
the error message will show up. If you comment this out, there will be no error.
That's interesting to observe. Do you have any idea why just showing this DataFrame in the same Spark Session would cause this error? I wonder if it could have something to do the stream closing/getting confused after the SQL call concludes.
Not sure, but I don't think show would cause this error. Could be Udf related issue.
Hi,
I got a similar issue with Spark .Net 1.0 but the issue is only when I am using Udf functions. The Show method shows the correct results.
Is it something that I should worry about? or is just some cleaning exceptions because the log is showing that the "RunSimpleWorker() finished successfully" and also "DotnetRunner: .NET application exited successfully".
Java version 8 Spark 2.4.1 or Spark 3.0 .Net Core 3.1 Microsoft.Spark.Worker-1.0.0
The udf functions are
Func<Column, Column> toTimeSpan = Functions.Udf<string, double>(str =>
{
TimeSpan ts;
if (TimeSpan.TryParse(str, out ts))
{
return ts.TotalSeconds;
}
return 0;
});
Func<Column, Column, Column> toDuration = Functions.Udf<string, string, double>((str1, str2) =>
{
TimeSpan ts1;
TimeSpan ts2;
if (TimeSpan.TryParse(str1, out ts1) && TimeSpan.TryParse(str2, out ts2))
{
return ts2.TotalSeconds - ts1.TotalSeconds;
}
return 0;
});
...
DataFrame df1 = dataFrame
.WithColumn("Start", toTimeSpan(dataFrame.Col("SessionStartTime")))
.WithColumn("End", toTimeSpan(dataFrame.Col("SessionEndTime")))
.WithColumn("Duration",toDuration(Functions.Col("SessionStartTime"), Functions.Col("SessionEndTime")))
|1010.0|60.0| 4| 98| 28| 00:07:00| 00:15:00|-111| 3|420.0| 900.0| 480.0|5000|5000|
+------+----+------+-----------+---------+----------------+--------------+----+----------+-----+------+--------+----+----+
only showing the top 20 rows
[2020-11-13T09:03:06.7403736Z] [N-20N3PF1C0NK5] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.IOException: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
---> System.Net.Sockets.SocketException (10053): An established connection was aborted by the software in your host machine.
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at System.IO.BufferedStream.Write(Byte[] array, Int32 offset, Int32 count)
at Microsoft.Spark.Interop.Ipc.SerDe.Write(Stream s, Byte[] value, Int32 count) in /_/src/csharp/Microsoft.Spark/Interop/Ipc/SerDe.cs:line 283
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.WriteOutput(Stream stream, IEnumerable`1 rows, Int32 sizeHint) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 191
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 158
at Microsoft.Spark.Worker.Command.SqlCommandExecutor.Execute(Version version, Stream inputStream, Stream outputStream, PythonEvalType evalType, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 76
at Microsoft.Spark.Worker.Command.CommandExecutor.Execute(Stream inputStream, Stream outputStream, Int32 splitIndex, CommandPayload commandPayload) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\CommandExecutor.cs:line 65
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 154
[2020-11-13T09:03:06.7410655Z] [N-20N3PF1C0NK5] [Error] [TaskRunner] [0] Exitin20/11/13 11:03:06 INFO SparkUI: Stopped Spark web UI at http://N-20N3PF1C0NK5.nsn-intra.net:4040
g with exception: System.IO.IOException: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
---> System.Net.Sockets.SocketException (10053): An established connection was aborted by the software in your host machine.
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at System.IO.BufferedStream.Write(Byte[] array, Int32 offset, Int32 count)
at Microsoft.Spark.Interop.Ipc.SerDe.Write(Stream s, Byte[] value, Int32 count) in /_/src/csharp/Microsoft.Spark/Interop/Ipc/SerDe.cs:line 283
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.WriteOutput(Stream stream, IEnumerable`1 rows, Int32 sizeHint) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 191
at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 158
at Microsoft.Spark.Worker.Command.SqlCommandExecutor.Execute(Version version, Stream inputStream, Stream outputStream, PythonEvalType evalType, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 76
at Microsoft.Spark.Worker.Command.CommandExecutor.Execute(Stream inputStream, Stream outputStream, Int32 splitIndex, CommandPayload commandPayload) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\CommandExecutor.cs:line 65
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 154
at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 66
[2020-11-13T09:03:06.7448127Z] [N-20N3PF1C0NK5] [Warn] [TaskRunner] [0] Exception while closing socket: System.IO.IOException: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
---> System.Net.Sockets.SocketException (10053): An established connection was aborted by the software in your host machine.
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at System.IO.BufferedStream.Flush()
at System.IO.BufferedStream.Dispose(Boolean disposing)
at System.IO.Stream.Close()
at System.IO.Stream.Dispose()
at Microsoft.Spark.Network.DefaultSocketWrapper.Dispose() in /_/src/csharp/Microsoft.Spark/Network/DefaultSocketWrapper.cs:line 56
at Microsoft.Spark.Worker.TaskRunner.Run() in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 115
[2020-11-13T09:03:06.7449393Z] [N-20N3PF1C0NK5] [Info] [TaskRunner] [0] Finished running 0 task(s).
[2020-11-13T09:03:06.7449660Z] [N-20N3PF1C0NK5] [Info] [SimpleWorker] RunSimpleWorker() finished successfully
I'm running a .NET Spark app for batch processing with a selection of GitHub projects data. My program runs as expected up through making a Spark Sql call:
After the Sql query, I perform a Filter() and Drop() for some final processing:
After I Show() each of these modified DataFrames, I get the error:
ProcessStream() failed with exception: System.ArgumentException: The stream is closed.
However, even with that error, both calls to Show() do execute successfully and see the correct output.Why might I be receiving this error (twice)? Since I get the correct output, the error doesn't seem to be affecting anything?