Closed elenacuoco closed 8 years ago
Why don't use the dataframe way to read data in your example? Have you tried with these lines?
conf = SparkConf() conf.set("spark.executor.memory", "4G") conf.set("spark.driver.memory", "2G") conf.set("spark.executor.cores", "7") conf.set("spark.python.worker.memory", "4G") conf.set("spark.driver.maxResultSize", "0") conf.set("spark.sql.crossJoin.enabled", "true") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.default.parallelism", "4") conf.set("spark.sql.crossJoin.enabled", "true") spark = SparkSession \ .builder.config(conf=conf) \ .appName("test-spark").getOrCreate() df = spark.read.csv("../input/train_numeric.csv", header="true",inferSchema="true",mode="DROPMALFORMED")
This works as well, but the DataBricks CSV package will allow you to indicate null values. For example, in the dataset they are denoted by -999. But anyway, you are right, you can do it like this. :)
Why don't use the dataframe way to read data in your example? Have you tried with these lines?