Azure / spark-cdm-connector

MIT License
75 stars 32 forks source link

Cannot read column with two different date formats #122

Open RithwikChhugani opened 1 year ago

RithwikChhugani commented 1 year ago

Hi there.

Is there a way to avoid reading datatype information for the columns from model.json ?

I have a column that has two different date formats in a single column which creates an issue while reading. I am planning to convert it to string for the time being.

kecheung commented 1 year ago

Currently there is no way.

Please check #57. There is a performance impact due to having a many to one relationship when writing a "CDM DateTime type". I recommend the app that's writing the CDM DateTime use the same format or write in another table.

RithwikChhugani commented 1 year ago

@kecheung Thanks for your reply.

The format of the date changes when the isDelete Flag of the record turns to True. The original format is: 2022-12-09T13:01:00.0000000+00:00 The changed format is: 2022-12-10T13:01:00.0000000

When reading this column using the spark cdm connector the data type inferred is "created : timestamp (nullable = true)" and the data type in the model.json is listed as "{"name":"createdon","dataType":"dateTimeOffset","maxLength":-1}".

Is there any way to handle this problem?

Thanks, Rithwik.

matthewmillsArinco commented 1 year ago

To confirm - this appears to be Synapse Link Default behaviour for AppendMode Entities.

The CreatedOn Column is populated with Timestamp+Timezone. If that record is deleted the CreatedOn is populated without the timezone.

image

Are there any other options to handle this? (i.e: Can we use the configPath to load a modified model.json where the column has a string datatype?)

kecheung commented 1 year ago

Can you raise the issue to Synapse Link when writing a csv file? Why are they using multiple formats? This is presenting a scalability issue.

Just because CDM allows a many to one relationship doesn't mean you should do it. This cannot be scaled because the multiple formats all need to be checked and similarly, the code needs to be written to support all the formats.

The few solutions are:

  1. Write in parquet format. The formats are standardized so we will not have this issue.
  2. Make the problematic column have a single format.
  3. Use this CDM version https://github.com/Azure/spark-cdm-connector/releases/tag/v1.19.2. The performance of reading CSVs will be impacted because it cycle through all formats. Any fixes that's worked on will be on the latest branch only (meaning we are insistent on having 1 column = 1 format).
matthewmillsArinco commented 1 year ago

Thanks for the quick response.

Can you raise the issue to Synapse Link when writing a csv file? Why are they using multiple formats? This is presenting a scalability issue.

This was previously done without a resolution as yet - based on this we'll keep chasing them.

Just because CDM allows a many to one relationship doesn't mean you should do it. This cannot be scaled because the multiple formats all need to be checked and similarly, the code needs to be written to support all the formats.

Yes I agree in principal, however we don't have control around the prioritisation of this fix and as a result we are blocked.

I ran into a similar issue with the CDM connector in data mapping flows last year - where lat/long fields were being exported with 8 decimal places by Synapse Link, marked as Decimal in the model.json and being read in with only 2 decimal places. We had to invest a tonne of time with the Data Mapping Flow/Dataverse teams to resolve - however if there was an option to load as string - it would have been mitigated.

We primarily use the connector as an ETL Extractor to push to Delta - so having the ability to read every column from the model.json as string would allow us to get a working solution in place while the root cause is resolved at source. It would also work as a catchall fallback in the event that any provider service creates this situation in the future.

The few solutions are:

  1. Write in parquet format. The formats are standardized so we will not have this issue.

Unable to do this until the feature is released by SynapseLink product team.

  1. Make the problematic column have a single format.

Unable to do this until the issue is fixed by SynapseLink support team.

  1. Use this CDM version https://github.com/Azure/spark-cdm-connector/releases/tag/v1.19.2. The performance of reading CSVs will be impacted because it cycle through all formats. Any fixes that's worked on will be on the latest branch only (meaning we are insistent on having 1 column = 1 format).

It looks like this is the version currently installed on the Synapse Spark 3.2 cluster -so no good I'm afraid.

kecheung commented 1 year ago

@matthewmillsArinco

  1. You can use Spark 3.1 in Synapse with the library management feature to load the custom library v1.19.2. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages

  2. You can also try library management with Spark 3.2. CDM Connector version v1.19.2 might work, but some dependencies reference Spark 3.1 since this version was released for Spark 3.1.

@RithwikChhugani

The format of the date changes when the isDelete Flag of the record turns to True. The original format is: 2022-12-09T13:01:00.0000000+00:00 The changed format is: 2022-12-10T13:01:00.0000000

The end result is still the same no matter what. The writer app is forcing multiple formats into a single type in a single column. If we had millions of rows, the time cost of checking will add up, and we had complaints about performance before. That is why I came up with a caching method. Same mitigations can apply in this and the post.