bazaarvoice / jolt

JSON to JSON transformation library written in Java.
Apache License 2.0
1.56k stars 329 forks source link

Remove duplicates or distinct Json files based on a field using Jolt transformation #286

Closed srinvias closed 5 years ago

srinvias commented 8 years ago

I am trying to remove duplicate json records from json array using Jolt transformation . Here is an example i tried : Input :

[
    {
        "id": 1,
        "name": "jeorge",
        "age": 25
    },
    {
        "id": 2,
        "name": "manhan",
        "age": 25
    },
    {
        "id": 1,
        "name": "george",
        "age": 225
    }
]

Jolt script :

[
  {
    "operation": "shift",
    "spec": {
      "*": {
        "id": "[&1].id"
      }
    }
  }
]

Output :

[ {
  "id" : 1
}, {
  "id" : 2
}, {
  "id" : 1
} ]

getting only selected records . along with that , i would like to remove duplicates . Desired Output :

[ {
  "id" : 1
}, {
  "id" : 2
} ] 

Please provide the necessary script which will help me . Thanks in advance .

milosimpson commented 8 years ago

Spec

[
  {
    "operation": "shift",
    "spec": {
      "*": { // top level array 
        "id": { // use the id as a key in map
          "*": "ids.&[]"
        }
      }
    }
  },
  {
    "operation": "shift",
    "spec": {
      "ids": {
        "*": {
          "$": "[#2].id"  // grab all the keys, which are now unique
        }
      }
    }
  }
]

Have to do two shifts. First one uses the input "id"s as keys in map. Second one iterates over the keys and puts them into a list.

srinvias commented 8 years ago

first of all thank you @milosimpson for quick response .

Correct me if i am wrong , the above solution may work only eliminate duplicate json records based on one field , but we have a senarios like eliminating duplicates based on multiple fields . in the below example domain,location,time,function,unit Please provide the scripts to process in jolt . Thanks .

or I can say simply eliminate duplicate json files from array of json

Input :

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "AOI_S1", "unit": "AOI_L31" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

Desired output : (I can say simply eliminate duplicate json files from array of json )

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR", { "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

milosimpson commented 8 years ago

At best, Jolt can dedup a single field. Deduping a whole "sub-document" is not in its scope.

srinvias commented 8 years ago

Thank you @milosimpson Is it able to delete duplicate Json record in array of JSONs ? If yes . May i know what will be the script ?

ravi21588 commented 7 years ago

Hi Milo, I have a similar scenario,what if i need to dedup using id and add other elements as well in target json like below.

Input :

[
    {
        "id": 1,
        "name": "jeorge",
        "age": 25
    },
    {
        "id": 2,
        "name": "manhan",
        "age": 25
    },
    {
        "id": 1,
        "name": "george",
        "age": 225
    }
]

Spec:

[
  {
    "operation": "shift",
    "spec": {
      "*": { // top level array 
        "id": { // use the id as a key in map
          "*": "ids.&[]"
        }
      }
    }
  },
  {
    "operation": "shift",
    "spec": {
      "ids": {
        "*": {
          "$": "[#2].id"  // grab all the keys, which are now unique
        }
      }
    }
  }
]

Output:

[
    {
        "id": 1,
        "name": "jeorge",
        "age": 25
    },
    {
        "id": 2,
        "name": "manhan",
        "age": 25
    }
]
hashcodemaster commented 5 years ago

Hi @milosimpson , I have a similar scenario,what if i need to dedup using id and add other elements as well in target json like below.

Input :

[ { "id": 1, "name": "jeorge", "age": 25 }, { "id": 2, "name": "manhan", "age": 25 }, { "id": 1, "name": "george", "age": 225 } ] Spec:

[ { "operation": "shift", "spec": { "": { // top level array "id": { // use the id as a key in map "": "ids.&[]" } } } }, { "operation": "shift", "spec": { "ids": { "*": { "$": "[#2].id" // grab all the keys, which are now unique } } } } ] Output:

[ { "id": 1, "name": "jeorge", "age": 25 }, { "id": 2, "name": "manhan", "age": 25 } ]