Azure / spark-cdm-connector

MIT License
75 stars 32 forks source link

The dataframe schema and cdm schema don't have an equal number of fields. Entity Path: "Contact" #121

Closed BenConstable9 closed 1 year ago

BenConstable9 commented 1 year ago

I'm having an issue using the spark connector in Azure Synapse.

I have the following dataFrame:

root
 |-- contactId: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- KeyName: string (nullable = true)
 |-- genderCode: string (nullable = true)
 |-- salutation: string (nullable = true)
 |-- EMailAddress1: string (nullable = true)

I am trying to write into the Contact model and generate a new manifest file in the following step:

renamed_df.write.format("com.microsoft.cdm")\
 .option("storage", "stodevcdm.dfs.core.windows.net")\
 .option("manifestPath", "cdm/Contacts/root.manifest.cdm.json")\
 .option("entity", "Contact")\
 .option("useCdmStandardModelRoot", True)\
 .option("useSubManifest", True)\
 .option("entityDefinitionPath", "core/applicationCommon/foundationCommon/crmCommon/accelerators/nonProfit/nonProfitCore/Contact.cdm.json/Contact")\
 .option("format", "parquet")\
 .save()

But this gives the error:

java.lang.Exception: The dataframe schema and cdm schema don't have an equal number of fields. Entity Path: "Contact"

I've searched all the documentation and issues on GitHub but I can't seem to find an option to only send specified fields.

Any help would be appreciated.

kecheung commented 1 year ago

I believe the the message is indicative of what your issue is. The manifest file is just metadata of where the files should be located.

Your dataframe has 6 columns, and I think your Contact schema does not have 6 columns. Can you see how many columns you have in your Contact.cdm.json?

BenConstable9 commented 1 year ago

The contact schema is the one from the accelerator on GitHub. It has 200 plus columns but I only want to provide the data for the 7 columns in my dataframe.

Is there a way to do this without creating a custom schema?

kecheung commented 1 year ago

You cannot do that. Using entityDefinitionPath says you defined the Contact cdm schema to have 200+ columns and the data input needs to match that. https://github.com/Azure/spark-cdm-connector/blob/spark3.2/documentation/overview.md#explicit-write-options

You can make your own contacts schema by doing something like this.

renamed_df.write.format("com.microsoft.cdm")
.option("storage", "stodevcdm.dfs.core.windows.net")
.option("manifestPath", "cdm/Contacts/root.manifest.cdm.json")
.option("entity", "Contact")
.option("format", "parquet")
.save()
BenConstable9 commented 1 year ago

You cannot do that. Using entityDefinitionPath says you defined the Contact cdm schema to have 200+ columns and the data input needs to match that. https://github.com/Azure/spark-cdm-connector/blob/spark3.2/documentation/overview.md#explicit-write-options

You can make your own contacts schema by doing something like this.

renamed_df.write.format("com.microsoft.cdm")
.option("storage", "stodevcdm.dfs.core.windows.net")
.option("manifestPath", "cdm/Contacts/root.manifest.cdm.json")
.option("entity", "Contact")
.option("format", "parquet")
.save()

Thanks for the response. If there a way to just pass Null values instead?

It is my understanding that when using the connector in Azure DataFactory, you can use Null values so I am trying to replicate this.

kecheung commented 1 year ago

We explicitly compare the dataframe schema against the cdm schema to determine if the dimensions match. Currently, there is no such feature to do that. If there is no use for the other columns, I would recommend to not have them in the first place. https://github.com/Azure/spark-cdm-connector/blob/8cea3ed563dcca2d565e92c7594cdf31b48de148/src/main/scala/com/microsoft/cdm/write/CDMBatchWriter.scala#L150-L154