Azure / Azure-DataFactory

Other
483 stars 588 forks source link

Azure Data Factory generates UTF-8 blob with a BOM #91

Closed fbdntm closed 2 years ago

fbdntm commented 5 years ago

When using Data Factory V2 with an output Dataset being a Json on an Azure Storage Blob V2 with a blank encodingName, the blob is encoded in UTF-8 with a BOM at the beginning, which is not conventional for UTF-8 and is not consistent with the output of other Azure services. For instance, when output binding an Azure Function to a blob and not specifying the encoding, it generates an UTF-8 blob without a BOM.

jonarmstrong commented 4 years ago

I've run into this same problem. If a use a dataset with the following definition, then the json file that is written is in UTF8-BOM encoding, which makes it incompatible with data flows.

{
    "name": "Raw_Deals_json",
    "properties": {
        "description": "This is the full export of the Deals w/o any additional processing",
        "linkedServiceName": {
            "referenceName": "datawarehouse_int",
            "type": "LinkedServiceReference"
        },
        "folder": {
            "name": "DataLake"
        },
        "annotations": [],
        "type": "Json",
        "typeProperties": {
            "location": {
                "type": "AzureBlobFSLocation",
                "fileName": "deals.json",
                "folderPath": "staged",
                "fileSystem": "int"
            }
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}
hmayer1980 commented 3 years ago

I did run into the same issue. You should be able to specifiy either both verson UTF-8 without BOM and UTF-8 with BOM or not writing a BOM as default at all...

jenilChristo commented 3 years ago

Facing the same issue in ADF json blob dataset.Is there a workaround to get rid of the BOM , while writing json dataset to Blob?. I am using the default (utf-8) encoding.

rahash123 commented 3 years ago

I was facing the same issue. It got resolved when I set the Encoding to US-ASCII while defining the dataset. image

jashwood commented 3 years ago

Actually, this is really not cool. This is a workaround. The trouble is that the Azure Data Factory json components fail because spark cannot handle the byte order mark.

shawnxzq commented 3 years ago

This is a known issue and we have work item to track this, will update here once it's resolved.

For now, if you choose "SetOfObjects" pattern in copy sink, the generated json will be able to consume in spark and have better performance than using "ArrayOfObjects" pattern

vignz commented 3 years ago

@shawnxzq : Even after changing to "SetOfObjects" pattern in copy sink, Still ADF Data Flow is not able to read the UTF-8 JSON file. Its throwing the below error.

Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('' (code 65279 / 0xfeff)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.InputStreamReader@372ac23c; line: 1, column: 2]

shawnxzq commented 3 years ago

@vignz Please make sure the "document per line" is checked as the document form in dataflow json settings.

valfirst commented 3 years ago

@fhljys was this issue fixed?

fhljys commented 3 years ago

I think the current status is that @shawnxzq provided a workaround.

shawnxzq commented 3 years ago

@fbdntm As this is a known issue and we have work item to track this, are you OK to close the issue here? Thanks!

hmayer1980 commented 3 years ago

I think it would be better to keep it open and you actually update it when your Work item is done - and a solution available to us.

rocque57 commented 3 years ago

The workaround that was suggested did not work. Whether you select SetOfObjects or ArrayOfObjects for your sink the resulting file still has a BOM. {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'SinkJSON': org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 10.139.64.11, executor 0): org.apache.spark.SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST.\n\tat

shawnxzq commented 3 years ago

@rocque57 You are right that json files with utf-8 encoding generated by copy will always have BOM, however, if you set pattern as "SetOfObjects" in copy sink, and set documetForm to "document per line" in dataflow source, you won't get the error mentioned above and this is also the recommended setting from perf perspective.

shawnxzq commented 3 years ago

@rocque57 As discussed, we have backlog items to track this, no clear ETA at moment, thanks!

shawnxzq commented 3 years ago

@rocque57 This is still in backlog as a new ask, thanks for your patience, thx!

ma185360 commented 3 years ago

@rocque57 Any Update on this??

evogelpohl commented 3 years ago

FWiW: Had the same error w/ Azure Data Factory reading a path full of jSON files written by PowerShell's approach, ala: $res = Invoke-RestMethod ... $res | ConvertTo-Json | Out-File .../some.json

As per above, trying to open those json files in Azure Data Factory Data Flow (using any option of json reading, Document Per line or Array of documents (which it is)) produced the FASTFAIL Malformed BOM error.

To Out-File, I added -Encoding utf8NoBOM. Azure Data Factory now reads the json files as expected.

shawnxzq commented 3 years ago

Latest update, this is included in our acting plan now, ETA to get the fix will be 1~2 months later, thanks!

shawnxzq commented 3 years ago

Most work done, pending deployment now.

LockTar commented 3 years ago

@shawnxzq what will be the end result? Must we update our datasets, pipelines etc? Or will the new default be without BOM? Thanks

mguard11 commented 3 years ago

@shawnxzq is there a tentative deployment date ?

shawnxzq commented 3 years ago

@mguard11 It will be fully available by end of next month.

For now, you can use it by manually editing the encoding of the json dataset to "UTF-8 without BOM".

Eniemela commented 3 years ago

Currently manually setting encoding to "UTF-8 without BOM" did solve this issue for me

shawnxzq commented 3 years ago

Thanks @Eniemela for confirm, later we will have full UX support for this as well, it will come next month.

mguard11 commented 3 years ago

Hi @shawnxzq is this being deployed next week with full UX support ?

Thanks Danesh

mguard11 commented 3 years ago

I'm setting the default file encoding to utf-8 in my sink settings as shown in screenshot but the output file still has utf-8 BOM encoding, is there a way to have a quick fix for this ?

Screen Shot 2021-10-22 at 8 57 31 AM Screen Shot 2021-10-22 at 8 55 38 AM

?

yew commented 3 years ago

@mguard11 , currently, the feature is not supported in our UI, please follow @shawnxzq 's suggestion: manually edit the encoding of the json dataset to "UTF-8 without BOM". image When the feature is available in our UI, you will see it like this: image

mguard11 commented 3 years ago

@yew I set it manually and it's throwing below error,

Screen Shot 2021-10-25 at 11 27 48 AM
yew commented 3 years ago

@mguard11 did you used Azure IR or SHIR? If the latter, please make sure the version is above 5.10.7918.2. You can download the latest version here.

yew commented 3 years ago

@mguard11 , in our telemetry, I can see your SHIR version is 4.15.7611.1, please upgrade it to use the new feature.

shawnxzq commented 2 years ago

Close it since issue is already fixed.

Eniemela commented 2 years ago

This "fix" apparently breaks the functionality again in Synapse! Synapse UI now enables selecting UTF-8 without BOM -but it fails to validate! image

image

yew commented 2 years ago

@Eniemela the feature is only supported in copy activity, @shawnxzq , is there a way to write a JSON file without BOM in dataflow?

Eniemela commented 2 years ago

We use JSON dataset as Sink for Copy activity (-> stores JSON with BOM if Default(UTF-8) is selected) and use the same dataset as source in mapping data flow which breaks if Default(UTF-8) is selected since it expects no BOM. This worked for us when we could manually enter "UTF-8 without BOM" to dataset but broke again when this UI support came out since it now prevents to use this encoding in a dataset used in mapping data flow. We currently use ISO-8859-1 as a workaround for dataset but it is not an optimal solution

yew commented 2 years ago

@Eniemela , oh, I see. Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

chr0m commented 2 years ago

No offence, but that is a ridiculous solution. My ADF is config driven from the database, I'm expected to create a whole new dataset just to use UTF-8 without BOM? What's the point of being able to make the encoding dynamic if it doesn't work? Even worse, it used to work and now it doesn't. Not sure how we can rely on this crap in a production environment.

chr0m commented 2 years ago

Is there a fix for this? It shouldn't be this hard to generate a UTF-8 CSV without BOM. I'm using ADF Copy Data activity, I select "UTF-8 without BOM" from the encoding type dropdown and when I run I get an error saying that it isn't a valid encoding type. Ridiculous!

Why has this been closed when it clearly hasn't been fixed?

yew commented 2 years ago

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

chr0m commented 2 years ago

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

I since discovered that it's the Get MetaData activity throwing the error when the dataset is set to UTF-8 without BOM

yew commented 2 years ago

@chr0m Many thanks for your feedback, I can repro the error, we will fix it and update here.

LockTar commented 2 years ago

Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

@yew is this going to be fixed? It's a bit strange that we must create 2 duplicate datasets... We work a lot with json datasets and there this is a real issue.

Since this issue is closed. Will there be a new one, or will this one be opened again?

bramvanhautte commented 2 years ago

@yew any update on this (closed) issue? Is there a new issue created?

I'm facing similar issues when using the copy activity with the 'UTF 8 without BOM' and trying to pick up that file in a Data Flow source using the same dataset.

It makes no sense to double up this dataset..

EDIT: tried it using doubling up datasets: did not work. Still receiving the "Malformed records are detected in schema inference. Parse Mode: FAILFAST" error. Sink dataset: UTF8 without BOM Source dataset: UTF8

Thanks!

zsponaugle commented 2 years ago

Hello @yew @shawnxzq @fhljys

I think this should be re-opened as well. The dual-dataset work-around is not ideal. Like many above me, I am trying to use a Data Flow after a Copy Activity that contains a JSON sink dataset. At the time of this writing, using Azure IR, I tried various settings and here are their outcomes:

one dataset using ISO-8859-1 encoding: Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the ISO-8859-1 encoding

one dataset using US-ASCII encoding: Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the US-ASCII encoding

Set of objects / Document per line and two datasets (UTF-8 without BOM, UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Set of objects / Document per line and one dataset (UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Array of objects / Array of documents and two datasets (UTF-8 without BOM, UTF-8 default) --> WORKS

Array of objects / Array of documents and one dataset (UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

So, as you can see, the only one that worked for me was using the two JSON datasets (copy activity sink set to UTF-8 without BOM and Data Flow source one set to Default(UTF-8) and setting copy sink file pattern to Array of objects and setting Data Flow source options --> JSON settings --> Document Form: Array of documents

Thank you.

Krumelur commented 1 year ago

Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)

SNY019 commented 1 year ago

US-ASCII encoding work around works for me. (July 2023 Issue is very much alive)

bt701 commented 5 months ago

This is also pretty frustrating from dataflows. I suspect we have some older files that had this, some new that don't, so the dataflow just 'FAILSFAST', but doesn't even point you to the file or error message.

It's a little bit surprising that datafactory can't really process the files it creates by default

niksolteq commented 3 months ago

Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)

Can you please explain this method more in depth?

I can see it is still an issue.