Azure-Samples / cdm-azure-data-services-integration

Tutorials and sample code for integrating CDM folders with Azure Data Services
MIT License
70 stars 46 forks source link

Unable to parse the date #19

Open gingergenius opened 4 years ago

gingergenius commented 4 years ago

This issue is for a: (mark with an x)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run read-write-demo-wide-world-importers.py

Any log messages given by the failure

Every step starting with display(salesBuyingGroupsDf) fails with date parsing error.

java.text.ParseException: Unable to parse the date: 01/01/2013 00:00:00 at org.apache.commons.lang.time.DateUtils.parseDateWithLeniency(DateUtils.java:359) at org.apache.commons.lang.time.DateUtils.parseDate(DateUtils.java:285) at com.microsoft.cdm.utils.DataConverter$$anonfun$6.apply(DataConverter.scala:43) at com.microsoft.cdm.utils.DataConverter$$anonfun$6.apply(DataConverter.scala:43) at com.microsoft.cdm.read.CDMDataReader$$anonfun$1.apply(CDMDataReader.scala:54) at com.microsoft.cdm.read.CDMDataReader$$anonfun$1.apply(CDMDataReader.scala:48) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:48) at com.microsoft.cdm.read.CDMDataReader.get(CDMDataReader.scala:19) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.next(DataSourceRDD.scala:59) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62) at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159) at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Expected/desired behavior

The data get parsed correctly and the notebook runs without errors

OS and Version?

Windows 10

Versions

Spark 2.4.3, Scala 2.11

Mention any other details that might be useful

The type of the columns causing the problem is DateTime in the CDM schema. When reading data within Databricks, the columns are assigned Data type (without the time part) even though Timestamp would be more appropriate.

I tried setting the schema on read myself (not allowed because of CDM). I tried setting DateFormat and TimestampFormat settings on read (no effect). I also tried converting the columns to Timestamp or String. Basically anything I try to do with the dataframe results in the same error.

gingergenius commented 4 years ago

The problem persists even after I drop ValidFrom and ValidTo from the salesBuyingGroupsDf dataframe. I try to display() or collect() it and the same error appears even though the date is not even there anymore!

Praveen-jsr commented 4 years ago

Facing the same issue with Timestamp data type. Raised the error in the below thread https://github.com/Azure/spark-cdm/issues

alibouhaddou commented 4 years ago

I've reproduced the same issue, the CDM lib read fields which are defined as datetime format in the model.json as a date. spark reader try to read that fields using a wrong format and that cause the issue. any fix for that please ?

datalord123 commented 4 years ago

did anyone ever figure out a fix for this? @alibouhaddou