Closed fbdntm closed 2 years ago
I've run into this same problem. If a use a dataset with the following definition, then the json file that is written is in UTF8-BOM encoding, which makes it incompatible with data flows.
{
"name": "Raw_Deals_json",
"properties": {
"description": "This is the full export of the Deals w/o any additional processing",
"linkedServiceName": {
"referenceName": "datawarehouse_int",
"type": "LinkedServiceReference"
},
"folder": {
"name": "DataLake"
},
"annotations": [],
"type": "Json",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileName": "deals.json",
"folderPath": "staged",
"fileSystem": "int"
}
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
I did run into the same issue. You should be able to specifiy either both verson UTF-8 without BOM and UTF-8 with BOM or not writing a BOM as default at all...
Facing the same issue in ADF json blob dataset.Is there a workaround to get rid of the BOM , while writing json dataset to Blob?. I am using the default (utf-8) encoding.
I was facing the same issue. It got resolved when I set the Encoding to US-ASCII while defining the dataset.
Actually, this is really not cool. This is a workaround. The trouble is that the Azure Data Factory json components fail because spark cannot handle the byte order mark.
This is a known issue and we have work item to track this, will update here once it's resolved.
For now, if you choose "SetOfObjects" pattern in copy sink, the generated json will be able to consume in spark and have better performance than using "ArrayOfObjects" pattern
@shawnxzq : Even after changing to "SetOfObjects" pattern in copy sink, Still ADF Data Flow is not able to read the UTF-8 JSON file. Its throwing the below error.
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('' (code 65279 / 0xfeff)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.InputStreamReader@372ac23c; line: 1, column: 2]
@vignz Please make sure the "document per line" is checked as the document form in dataflow json settings.
@fhljys was this issue fixed?
I think the current status is that @shawnxzq provided a workaround.
@fbdntm As this is a known issue and we have work item to track this, are you OK to close the issue here? Thanks!
I think it would be better to keep it open and you actually update it when your Work item is done - and a solution available to us.
The workaround that was suggested did not work. Whether you select SetOfObjects or ArrayOfObjects for your sink the resulting file still has a BOM. {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'SinkJSON': org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 10.139.64.11, executor 0): org.apache.spark.SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST.\n\tat
@rocque57 You are right that json files with utf-8 encoding generated by copy will always have BOM, however, if you set pattern as "SetOfObjects" in copy sink, and set documetForm to "document per line" in dataflow source, you won't get the error mentioned above and this is also the recommended setting from perf perspective.
@rocque57 As discussed, we have backlog items to track this, no clear ETA at moment, thanks!
@rocque57 This is still in backlog as a new ask, thanks for your patience, thx!
@rocque57 Any Update on this??
FWiW: Had the same error w/ Azure Data Factory reading a path full of jSON files written by PowerShell's approach, ala:
$res = Invoke-RestMethod ...
$res | ConvertTo-Json | Out-File .../some.json
As per above, trying to open those json files in Azure Data Factory Data Flow (using any option of json reading, Document Per line or Array of documents (which it is)) produced the FASTFAIL Malformed BOM error.
To Out-File, I added -Encoding utf8NoBOM
. Azure Data Factory now reads the json files as expected.
Latest update, this is included in our acting plan now, ETA to get the fix will be 1~2 months later, thanks!
Most work done, pending deployment now.
@shawnxzq what will be the end result? Must we update our datasets, pipelines etc? Or will the new default be without BOM? Thanks
@shawnxzq is there a tentative deployment date ?
@mguard11 It will be fully available by end of next month.
For now, you can use it by manually editing the encoding of the json dataset to "UTF-8 without BOM".
Currently manually setting encoding to "UTF-8 without BOM" did solve this issue for me
Thanks @Eniemela for confirm, later we will have full UX support for this as well, it will come next month.
Hi @shawnxzq is this being deployed next week with full UX support ?
Thanks Danesh
I'm setting the default file encoding to utf-8 in my sink settings as shown in screenshot but the output file still has utf-8 BOM encoding, is there a way to have a quick fix for this ?
?
@mguard11 , currently, the feature is not supported in our UI, please follow @shawnxzq 's suggestion: manually edit the encoding of the json dataset to "UTF-8 without BOM". When the feature is available in our UI, you will see it like this:
@yew I set it manually and it's throwing below error,
@mguard11 did you used Azure IR or SHIR? If the latter, please make sure the version is above 5.10.7918.2. You can download the latest version here.
@mguard11 , in our telemetry, I can see your SHIR version is 4.15.7611.1, please upgrade it to use the new feature.
Close it since issue is already fixed.
This "fix" apparently breaks the functionality again in Synapse! Synapse UI now enables selecting UTF-8 without BOM -but it fails to validate!
@Eniemela the feature is only supported in copy activity, @shawnxzq , is there a way to write a JSON file without BOM in dataflow?
We use JSON dataset as Sink for Copy activity (-> stores JSON with BOM if Default(UTF-8) is selected) and use the same dataset as source in mapping data flow which breaks if Default(UTF-8) is selected since it expects no BOM. This worked for us when we could manually enter "UTF-8 without BOM" to dataset but broke again when this UI support came out since it now prevents to use this encoding in a dataset used in mapping data flow. We currently use ISO-8859-1 as a workaround for dataset but it is not an optimal solution
@Eniemela , oh, I see. Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".
No offence, but that is a ridiculous solution. My ADF is config driven from the database, I'm expected to create a whole new dataset just to use UTF-8 without BOM? What's the point of being able to make the encoding dynamic if it doesn't work? Even worse, it used to work and now it doesn't. Not sure how we can rely on this crap in a production environment.
Is there a fix for this? It shouldn't be this hard to generate a UTF-8 CSV without BOM. I'm using ADF Copy Data activity, I select "UTF-8 without BOM" from the encoding type dropdown and when I run I get an error saying that it isn't a valid encoding type. Ridiculous!
Why has this been closed when it clearly hasn't been fixed?
@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.
@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.
I since discovered that it's the Get MetaData activity throwing the error when the dataset is set to UTF-8 without BOM
@chr0m Many thanks for your feedback, I can repro the error, we will fix it and update here.
Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".
@yew is this going to be fixed? It's a bit strange that we must create 2 duplicate datasets... We work a lot with json datasets and there this is a real issue.
Since this issue is closed. Will there be a new one, or will this one be opened again?
@yew any update on this (closed) issue? Is there a new issue created?
I'm facing similar issues when using the copy activity with the 'UTF 8 without BOM' and trying to pick up that file in a Data Flow source using the same dataset.
It makes no sense to double up this dataset..
EDIT: tried it using doubling up datasets: did not work. Still receiving the "Malformed records are detected in schema inference. Parse Mode: FAILFAST" error. Sink dataset: UTF8 without BOM Source dataset: UTF8
Thanks!
Hello @yew @shawnxzq @fhljys
I think this should be re-opened as well. The dual-dataset work-around is not ideal. Like many above me, I am trying to use a Data Flow after a Copy Activity that contains a JSON sink dataset. At the time of this writing, using Azure IR, I tried various settings and here are their outcomes:
one dataset using ISO-8859-1 encoding: Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the ISO-8859-1 encoding
one dataset using US-ASCII encoding: Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the US-ASCII encoding
Set of objects / Document per line and two datasets (UTF-8 without BOM, UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST
Set of objects / Document per line and one dataset (UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST
Array of objects / Array of documents and two datasets (UTF-8 without BOM, UTF-8 default) --> WORKS
Array of objects / Array of documents and one dataset (UTF-8 default): Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST
So, as you can see, the only one that worked for me was using the two JSON datasets (copy activity sink set to UTF-8 without BOM and Data Flow source one set to Default(UTF-8) and setting copy sink file pattern to Array of objects and setting Data Flow source options --> JSON settings --> Document Form: Array of documents
Thank you.
Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)
US-ASCII encoding work around works for me. (July 2023 Issue is very much alive)
This is also pretty frustrating from dataflows. I suspect we have some older files that had this, some new that don't, so the dataflow just 'FAILSFAST', but doesn't even point you to the file or error message.
It's a little bit surprising that datafactory can't really process the files it creates by default
Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)
Can you please explain this method more in depth?
I can see it is still an issue.
When using Data Factory V2 with an output Dataset being a Json on an Azure Storage Blob V2 with a blank encodingName, the blob is encoded in UTF-8 with a BOM at the beginning, which is not conventional for UTF-8 and is not consistent with the output of other Azure services. For instance, when output binding an Azure Function to a blob and not specifying the encoding, it generates an UTF-8 blob without a BOM.