Cinchoo / ChoETL

ETL framework for .NET (Parser / Writer for CSV, Flat, Xml, JSON, Key-Value, Parquet, Yaml, Avro formatted files)
MIT License
794 stars 134 forks source link

Flatten Complex Json structure #191

Open CreaterCalvin opened 2 years ago

CreaterCalvin commented 2 years ago

I have these line of code:

using (var r = new ChoJSONReader("data.json")
                         .Configure(c => c.ThrowAndStopOnMissingField = true)
                         .Configure(c => c.DefaultArrayHandling = true)
                         .Configure(c => c.FlattenNode = true)
                         .Configure(c => c.IgnoreArrayIndex = false)
                         .Configure(c => c.NestedKeySeparator = '.')
                         .Configure(c => c.NestedColumnSeparator = '.'))
            {
                var dt = r.AsDataTable();
                Console.WriteLine(dt.DumpAsJson());
            }

and my json:

{
  "BrandId": "998877665544332211",
  "Categories": [
    "112233445566778899"
  ],
  "Contact": {
    "Phone": [
      {
        "Value": "12346789",
        "Description": {
          "vi": "Phone"
        },
        "Type": 1
      },
      {
        "Value": "987654321",
        "Description": {
          "vi": "Phone"
        },
        "Type": 1
      }
    ]
  }
}

My expected output:

[
  {
    "BrandId": "998877665544332211",
    "Contact.Phone.Value": "12346789",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1,
    "Category": "112233445566778899"
  },
  {
    "BrandId": "998877665544332211",
    "Contact.Phone.Value": "987654321",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1,
    "Category": "112233445566778899"
  }
]

After running code, I got the output like this:

[
  {
    "BrandId": "998877665544332211",
    "Contact.Phone.0.Value": "12346789",
    "Contact.Phone.0.Description.vi": "Phone",
    "Contact.Phone.0.Type": 1,
    "Contact.Phone.1.Value": "987654321",
    "Contact.Phone.1.Description.vi": "Phone",
    "Contact.Phone.1.Type": 1,
    "Category0": "112233445566778899"
  }
]

Thanks for your interest in the question Looking forward to hearing from you @Cinchoo

Cinchoo commented 2 years ago

As your json is nested in nature, you need to unpack and flatten into multiple rows using linq.

Sample fiddle: https://dotnetfiddle.net/PHK8LO

CreaterCalvin commented 2 years ago

@Cinchoo Is there any more generic way instead of using Linq. I would prefer it could transform via .Configure() with .WithField(), is it possible? Because I would like to make it as simple as possible, in order to ease for my more than 200 json schemas. Because of that reason, linq is seem to be not suitable for my problem image

Cinchoo commented 2 years ago

Add attribute to handle your req. Use FlattenByNodeName to control the flatten node. Hope it helps.

Lib version: ChoETL.JSON.Core v1.2.1.45-beta1

Sample fiddle: https://dotnetfiddle.net/ZWRd1R

CreaterCalvin commented 2 years ago

@Cinchoo Thanks a lot for your help. This lib could handle my prob smoothly.

CreaterCalvin commented 2 years ago

@Cinchoo Looking forward to hearing from you if your lib has new update & Looking forward to your help and support if I have new issue, too. Peace!!!

CreaterCalvin commented 2 years ago

@Cinchoo I have just discovered a new issue from your library When I'm testing some special case base on you package: https://dotnetfiddle.net/AdFCjU In this code, I'm flatten this json string:

@"{
  ""BrandId"": ""998877665544332211"",
  ""Categories"": [ ""112233445566778899"" ],
  ""Contact"": {
    ""Phone"": [
      {
        ""Value"": ""12346789"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      },
      {
        ""Value"": ""987654321"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      }
    ],
     ""Phone2"": [
      {
        ""Value"": ""999888777"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      }
    ]
  }
}";

It's return output:


[
  {
    "BrandId": "998877665544332211",
    "Categories": "112233445566778899",
    "Contact.Phone2.Value": "999888777",
    "Contact.Phone2.Description.vi": "Phone",
    "Contact.Phone2.Type": 1,
    "Contact.Phone.Value": "12346789",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1
  },
  {
    "BrandId": "998877665544332211",
    "Categories": "112233445566778899",
    "Contact.Phone2.Value": "999888777",
    "Contact.Phone2.Description.vi": "Phone",
    "Contact.Phone2.Type": 1,
    "Contact.Phone.Value": "987654321",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1
  }
]

Result above is exactly what I wanted

But if I change name of Phone2 into BlaBlaBlaPhone

@"{
  ""BrandId"": ""998877665544332211"",
  ""Categories"": [ ""112233445566778899"" ],
  ""Contact"": {
    ""Phone"": [
      {
        ""Value"": ""12346789"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      },
      {
        ""Value"": ""987654321"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      }
    ],
     ""BlablablaPhone"": [ 
      {
        ""Value"": ""999888777"",
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      }
    ]
  }
}"

It's return output without value in BlablablaPhone node:

[
  {
    "BrandId": "998877665544332211",
    "Categories": "112233445566778899",
    "Contact.Phone.Value": "12346789",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1
  },
  {
    "BrandId": "998877665544332211",
    "Categories": "112233445566778899",
    "Contact.Phone.Value": "987654321",
    "Contact.Phone.Description.vi": "Phone",
    "Contact.Phone.Type": 1
  },
  {
    "BrandId": "998877665544332211",
    "Categories": "112233445566778899",
    "Contact.Phone.Value": null,
    "Contact.Phone.Description.vi": null,
    "Contact.Phone.Type": null
  }
]

@Cinchoo

Cinchoo commented 2 years ago

I’m on vacation, will not be get back to you on time. Will update once back from it.

Cinchoo commented 2 years ago

It is a bug, putting a fix. Will let you know.

Cinchoo commented 2 years ago

Applied fix, released ChoETL.JSON.Core v1.2.1.45-beta2.

zhenyuan0502 commented 2 years ago

@Cinchoo Hi, I have just got the issue with NestedKeySeparator from 1.2.1.45.Beta-2 onward, Configure(c => c.NestedKeySeparator = '.') is works fine but Configure(c => c.NestedKeySeparator = '~') cannot be flattened as expected.

This bug still exists in 1.2.1.47

https://dotnetfiddle.net/327X8k

Cinchoo commented 2 years ago

found the issue, applied fix. released v1.2.1.48 (beta1)

Sample fiddle: https://dotnetfiddle.net/SDY3ql

zhenyuan0502 commented 2 years ago

@Cinchoo Hi, today I see the issue about the flatten feature, which they will be lost root information if the flatten node whom array

@"{
  ""BrandId"": ""998877665544332211"",
  ""Categories"": [ ""112233445566778899"" ],
  ""Contact"": [{
        ""Value"": ""12346789"",
        ""PostCode"":
        [
        {
        ""Area"":""SA"",
        ""Value"": ""+84""
        },
        {
        ""Area"":""SA"",
        ""Value"": ""+841111""
        }
        ],
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
      },
      {
        ""Value"": ""987654321"",
        ""PostCode"":
        [
        {
        ""Area"":""SA"",
        ""Value"": ""+8412""
        }
        ],
        ""Description"": { ""vi"": ""Phone"" },
        ""Type"": 1
    }]
}";

I try both FlattenByNodeName and FlattenByJsonPath, they got different results but no one could fit the expected result like below json:

[
  {
    "BrandId": "998877665544332211",
    "Value": "12346789",
    "Description.vi": "Phone",
    "Type": 1,
    "PostCode~Area": "SA",
    "PostCode~Value": "+84"
  },
  {
    "BrandId": "998877665544332211",
    "Value": "12346789",
    "Description.vi": "Phone",
    "Type": 1,
    "PostCode~Area": "SA",
    "PostCode~Value": "+841111"
  },
  {
    "BrandId": "998877665544332211",
    "Value": "987654321",
    "Description.vi": "Phone",
    "Type": 1,
    "PostCode~Area": "SA",
    "PostCode~Value": "+8412"
  }
]

If I remove the PostCode, it worked fine normally. https://dotnetfiddle.net/ZqyiI2

I am very interesting in this feature and I could help to improve it because I am building a new and dynamic ETL project based on your library, could you push your latest code then I can take a look.