Parquet: Data Factory is suffixing output files with .parquet causing subsequent steps to fail

MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure

https://docs.microsoft.com/azure

Creative Commons Attribution 4.0 International

10.25k stars 21.41k forks source link

Parquet: Data Factory is suffixing output files with .parquet causing subsequent steps to fail #38633

Closed km26577 closed 5 years ago

km26577 commented 5 years ago

Parquet: Data Factory is suffixing output files with .parquet causing subsequent steps to fail

MartinJaffer-MSFT commented 5 years ago

Hello km26577 and thank you for your inquiry. Which document is this in reference to?

km26577 commented 5 years ago

Parquet: Data Factory is suffixing output files with .parquet causing subsequent steps to fail

Scenario: we got data files with .data extension in data lake(ADLA) which were ingested using data factory from onprem. copy activity takes data file and converts into parquet file with no change in filename. This output files are used by ADLA usql script which expects the same name as ingested file. This was working fine until last week. Now parquet conversion step is appending .parquet to output files causing ADLA usql script to fail.

KranthiPakala-MSFT commented 5 years ago

@km26577 Sorry you are experiencing this. This is due to recent enhancement for Copy activity behavior change by ADF product team. Previously a Copy to parquet format generate the file name as ".csv", this is not the correct behavior. The new enhancement which was released last week targets to fix this behavior. Generating ".parquet" file for parquet format should be the right behavior.

Please refer to below MSDN thread, where another user has reported the same issue and fixed by updating their pipleines to reflect the latest Copy behaviour for file formats. https://social.msdn.microsoft.com/Forums/en-US/fa7df897-4303-475c-8449-97d7cfbac9e1/copy-data-activity-suddenly-changes-file-extension?forum=AzureDataFactory

Apologizes for the inconvenience.

km26577 commented 5 years ago

@KranthiPakala-MSFT I don't think it should add any suffix to the output files. adding suffix doesn't make it parquet or csv file. if thats the case when we ingest the data using copy activity into lake with text fomat with delimiter as pipe(|) what should be suffix. Is this suffix consistent across all file formats (text, orc, parquet, json, avro)

In above case for us no suffix is added and we are configuring it to have suffix .data (eg: customer.data) for text file with pipe delimiter. I would expect the same when use copy activity to transform the text file to parquet, no suffix should be added.

KranthiPakala-MSFT commented 5 years ago

@km26577 Thanks for your response. I would suggest you to please share your feedback in ADF user voice forum. All the feedback shared in the user voice forum will be monitored by ADF engineering team and take appropriate actions.

As this topic is related to a specific scenario and not directly related to any document content we request you to please email us at AzCommunity@Microsoft.com for further continuation of this discussion.

Please send below detials: Subject of your email:<Azure Data Factory: Parquet - Data Factory is suffixing output files with .parquet causing subsequent steps to fail> Subscription ID: GitHub Thread link: https://github.com/MicrosoftDocs/azure-docs/issues/38633

If you have a document which you are following and if you think it needs an update, please let us know about that here.

We will now proceed to close this thread.