Improved event data format JSON

9inpachi commented 4 years ago

Hi all,

So about the event data JSON. Edward and I discussed a new format that's about 40% less in size.

Previous Data Format:

{
  "event number": 000,
  "run number": 000,
  "other event info": 000,
  "ObjectType": {
    "ObjectTypeCollection": [
      {
        "PhysicsObjectParameter1": "PhysicsObjectParameter1_VALUE",
        "PhysicsObjectParameter2": "PhysicsObjectParameter2_VALUE"
      }
    ]
  }
}

New Data Format (inspired by CMS ".ig"):

{
  "event number": 000,
  "run number": 000,
  "other event info": 000,
  "ObjectType": {
    "ObjectTypeCollection": {
      "types": [
          "PhysicsObjectParameter1", "PhysicsObjectParameter2"
        ],
      "data": [
        [
          "PhysicsObjectParameter1_VALUE", "PhysicsObjectParameter2_VALUE"
        ]
      ]
    }
  }
}

The advantage of the new data format is that we don't have to use object keys for identifying each parameter of physics object in a collection. Using keys as parameter identifiers takes a lot of space since we are duplicating keys for every physics object. So having an array named "types" for identifying the index of each parameter will help us in not having to use keys for parameters of physics objects and instead use the types array to identify at which index a parameter exists.

Now there are several questions or rather discrepancies which I would like to discuss.

Should we include types array directly inside ObjectType or inside ObjectTypeCollection? Can object parameters be different for different collections?
How should we integrate it with Phoenix?
- Do we completely change the framework to support this format for generating PhoenixObjects and processing each collection?
- Do we keep the framework as is and process the new data format to convert it to the old one so it can be processed normally by the framework?

That's all. (I know it's a lot - but since we are about to change an integral part of Phoenix - it would be better if everything is clarified)

Peace. :)

9inpachi commented 4 years ago

So, adding to what I said earlier. The size difference is actually remarkable! The more data we have, the more space we save.

Take a look at the same JSON data converted to the new format:

EdwardMoyse commented 4 years ago

Hi @9inpachi, thanks for this! Truly impressive size reductions!

So, after some reflection I think: 1) We should put types into ObjectType : I know for ATLAS that e.g. Tracks can have different content depending on the collection. It would be okay to do it the other way around (we would just make types the superset) but since I don't think this will have a significant impact on the size, I think the extra flexibility and clarity is worth it. We could actually do both - if types are defined in the collection then they aren't needed in the ObjectType, but otherwise they are? 2) Probably we should just completely change the framework. The only pain from my side is I will need to rewrite the dumping functions in Athena. I guess we could have two versions of the format - and call this compact phoenix format JSON? But we shouldn't overcomplicate the code.

Of course, compressing the files would make them REALLY tiny!

9inpachi commented 4 years ago

if types are defined in the collection then they aren't needed in the ObjectType, but otherwise they are?

This is surely possible but it will be upon the one generating the files to specify all this information - which I think might be troublesome. Using types inside each collection separately is more flexible I think - and we won't be adding more than 2-3 KB of data if we use it.

The only pain from my side is I will need to rewrite the dumping functions in Athena.

You don't actually need to. Just use the same functions you currently have. There is a function in Phoenix to convert the older JSON format to newer one and download the file.

I guess we could have two versions of the format - and call this compact phoenix format JSON? But we shouldn't overcomplicate the code.

Yes, this is my approach so far - the reason why I didn't change the framework at all. And I think it's better if we support both the formats. Handling the current Phoenix format is easier in the code so we should just convert the compact Phoenix format to the current one and that should be it.

9inpachi commented 4 years ago

Hi @EdwardMoyse,

So I was finalizing this and I have some really bad news - rendering all the conversion functions useless. That is, the sizes I compared was for current format with spacing (formatted JSON) and the new format with minified JSON (no spacing) which naturally decreases the size. In actuality, the size difference when both the formats are minified is only 9 KB (229 KB for current format and 220 KB for the new format). :(

EdwardMoyse commented 1 year ago

I guess we can close this one? (Re-open if you disagree!)

HSF / phoenix

Improved event data format JSON #108