Copy JSON Array data from REST data factory to Azure Blob as is

tonyStann commented 5 years ago

I have used REST to get data from API and the format of JSON output that contains arrays. When I am trying to copy the JSON as it is using copy activity to BLOB, I am only getting first object data and the rest is ignored.

In the documentation is says we can copy JSON as is by skipping schema section on both dataset and copy activity. I followed the same and I am the getting the output as below.

Sample REST output: { "totalPages": 500, "firstPage": true, "lastPage": false, "numberOfElements": 50, "number": 0, "totalElements": 636, "columns": { "dimension": { "id": "variables/page", "type": "string" }, "columnIds": [ "0" ] }, "rows": [ { "itemId": "1234", "value": "home", "data": [ 65 ] }, { "itemId": "1235", "value": "category", "data": [ 92 ] }, ], "summaryData": { "totals": [ 157 ], "col-max": [ 123 ], "col-min": [ 1 ] } }

BLOB Output as the text is below: which is only first object data totalPages,firstPage,lastPage,numberOfElements,number,totalElements 500,True,False,50,0,636

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: c9da1816-3f64-bd45-443f-5bdb48bacc24
Version Independent ID: e649f787-62da-fc97-2e9a-ba89765beed4
Content: Copy data from a REST source by using Azure Data Factory
Content Source: articles/data-factory/connector-rest.md
Service: data-factory
GitHub Login: @linda33wj
Microsoft Alias: jingwang

KranthiPakala-MSFT commented 5 years ago

@tonyStann Thank you for reaching out and bringing this to our notice. At this time we are reviewing the feedback and will provide an update as appropriate

KranthiPakala-MSFT commented 5 years ago

@tonyStann Please have a look at the below stackoverflow thread which might be helpful.

Azure Data Factory - How to handle nested Array inside JSON data to import to Blob Storage

Meanwhile we are following up with the internal team on this issue.

tonyStann commented 5 years ago

@KranthiPakala-MSFT Thanks for working on the issue,

The answer in the stack overflow used JSON file as an input so he got options to select file format settings of JSON. I am using REST as Source and Datalake v2 as a sink.
The original question in the thread, he wants to use schema mapping to the sink. However, in this article, it mentioned to skip structure (schema) in the dataset and copy activity to Export the JSON response as-is and I am following the same but can't get the output as input. Please let me know if you heard back from your team.

KranthiPakala-MSFT commented 5 years ago

@tonyStann Thanks for providing additional detailed info. I will keep posted once we hear back a response from the internal team.

tonyStann commented 5 years ago

The output I provided is the actual REST API output and I am providing another POST request data below Output from REST connector: [ { "totalPages": 1, "firstPage": true, "lastPage": true, "numberOfElements": 12, "number": 0, "totalElements": 12, "columns": { "dimension": { "id": "variables/lasttouchchannel", "type": "string" }, "columnIds": [ "0", "1", "2", "3" ] }, "rows": [ { "itemId": "4", "value": "Direct", "data": [ 1089061, 56798, 1350518.68, 27377 ] }, { "itemId": "3", "value": "Organic Search", "data": [ 80182, 5331, 118122.95, 2268 ] }, { "itemId": "1", "value": "Email", "data": [ 79867, 5071, 120105.02, 2416 ] }, { "itemId": "2", "value": "Paid Search", "data": [ 69265, 5059, 106342.61, 2263 ] }, { "itemId": "10", "value": "Comparison Shopping", "data": [ 64380, 3029, 76853.3, 1636 ] }, { "itemId": "14", "value": "McKinney", "data": [ 51444, 522, 14353.07, 298 ] }, { "itemId": "8", "value": "Affiliate", "data": [ 42952, 2643, 57588.46, 1093 ] }, { "itemId": "5", "value": "Social Networks", "data": [ 35901, 1452, 38608.72, 874 ] }, { "itemId": "13", "value": "Session Refresh", "data": [ 29848, 2494, 55323.89, 1078 ] }, { "itemId": "9", "value": "Misc Paid", "data": [ 9731, 1099, 23938.600000000002, 387 ] }, { "itemId": "7", "value": "Referring Domains", "data": [ 7465, 563, 15566.08, 175 ] }, { "itemId": "6", "value": "Banner Ads", "data": [ 3603, 82, 2249.55, 50 ] } ], "summaryData": { "totals": [ 1558570, 84143, 1979570.93, 39915 ], "col-max": [ 1089061, 56798, 1350518.68, 27377 ], "col-min": [ 3603, 82, 2249.55, 50 ] } } ]

Output into Data lake Gen2 without using schema: 1,True,True,12,0,12 (without selecting first row as headers)

totalPages,firstPage,lastPage,numberOfElements,number,totalElements (selecting first row as headers) 1,True,True,12,0,12

I am not using a schema in the dataset and the copy activity pipeline. I am trying to store the response as-is into the data lake. Let your internal team contact me for additional information or look into my activity in DataFactory. Thanks for working on it.

-tony

On Wed, Jul 31, 2019 at 5:45 PM KranthiPakala-MSFT notifications@github.com wrote:

@tonyStann https://github.com/tonyStann Just wanted to confirm if the Sample REST output you have provided is the actual REST API output or the output stored in Blob? Would it be possible to share both - I mean the actual API output with complete array and the incomplete output stored in Blob?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/36219?email_source=notifications&email_token=AMYMV6IDYXPMRS4XHERLKFDQCIP2DA5CNFSM4IIHLGM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3I4BUA#issuecomment-517062864, or mute the thread https://github.com/notifications/unsubscribe-auth/AMYMV6OFBKY7JUFGFSD2KCDQCIP2DANCNFSM4IIHLGMQ .

KranthiPakala-MSFT commented 5 years ago

@tonyStann Thanks for sharing the output. Would it be possible to share the pipeline JSON payload?

tonyStann commented 5 years ago

@KranthiPakala-MSFT Below is my Pipeline JSON: { "name": "ZZ_Copy", "properties": { "description": "test for REST", "activities": [ { "name": "ZZ_CopySource", "type": "Copy", "dependsOn": [], "policy": { "timeout": "7.00:00:00", "retry": 0, "retryIntervalInSeconds": 30, "secureOutput": false, "secureInput": false }, "userProperties": [ { "name": "Source", "value": "segments" }, { "name": "Destination", "value": "tony/files" } ], "typeProperties": { "source": { "type": "RestSource", "httpRequestTimeout": "00:01:40", "requestInterval": "00.00:00:00.010" }, "sink": { "type": "DelimitedTextSink", "storeSettings": { "type": "AzureBlobFSWriteSetting" }, "formatSettings": { "type": "DelimitedTextWriteSetting", "quoteAllText": true, "fileExtension": "" } }, "enableStaging": false }, "inputs": [ { "referenceName": "ZZ_CopySource", "type": "DatasetReference" } ], "outputs": [ { "referenceName": "ZZ_CopySink", "type": "DatasetReference" } ] } ], "annotations": [] }, "type": "Microsoft.DataFactory/factories/pipelines" }

KranthiPakala-MSFT commented 5 years ago

@tonyStann Thanks much for sharing the details. We are looking into this, will get back as soon as we have an update.

KranthiPakala-MSFT commented 5 years ago

@tonyStann According to the pipeline payload provided, looks like you are trying to copy the JSON array response to BLOB as "DelimitedTextSink" - assuming .csv file. In this case JSON won't be converted as is in the blob when you define the sink as .csv file, the sink needs to be defined as JSON file format in order to copy the same JSON array response to blob as is.

Since CSV is a kind of tabular format, In order to convert to CSV file format in sink, the schema mapping is required for JSON array's.

Hope this helps.

KranthiPakala-MSFT commented 5 years ago

@tonyStann Looks like the same question was posted in MSDN and stackoverflow forums, and another engineer is working with you on the issue.

In order to avoid duplicate work, we will now proceed to close this thread. If there are further questions regarding this matter, please feel free to tag me in your reply and reopen the issue. We will gladly continue the discussion.

MicrosoftDocs / azure-docs

Copy JSON Array data from REST data factory to Azure Blob as is #36219

Document Details

Output into Data lake Gen2 without using schema: 1,True,True,12,0,12 (without selecting first row as headers)