RalphTro / epcis-event-hash-generator

ALGORITHM and SOFTWARE PROTOTYPE to uniquely identify/validate the integrity of any EPCIS event through a common, syntax-agnostic approach based on hashing. Takes an EPCIS Document (formatted in either XML or JSON-LD) and returns the corresponding hash value(s).
MIT License
8 stars 4 forks source link

Support for ignoring repo specific fields through configurable parameters #109

Closed dakbhavesh closed 4 months ago

dakbhavesh commented 1 year ago

While adding support for generating hash from EPCIS Query Document we observed that query document may be enriched with repository-specific fields like "hash", "transactionId", "captureID", etc. To ensure the consistency of generated hash must ignore such fields. Such fields are hard to know in advance as each EPCIS repository may behave uniquely and have its own nomenclature. Thus, it would be nice to add a parameter that can be used to provide fields to ignore externally.

The parameter may look like below:

Name: ignoreFields Values: comma separated name of the fields to ignore

--ignoreFields hash,captureID

sboeckelmann commented 1 year ago

If we to support this kind of feature officially, ignoredFields should be added to hash url i.e. ni:///sha-256;b0c8cf9e9a733fb18a0134bfb0f64e701432c6bebda06124a1e62f087d71a333?ver=CBV2.0&ignoreFields=captureID,hash

Otherwise it's not possible to regenerate hash for verification purposes. This request doesn't seem to be completely thought through. Maybe it's better to not allow any additional fields in events returned witth EPCISQueryDocument Only fields that are excluded from hash generation by default should be allowed to be added in query (eventID, recordTime and error declaration related one).

@RalphTro what's your take on this?

RalphTro commented 1 year ago

Hi @dakbhavesh and @sboeckelmann, When Bhavesh initially suggested to tackle this issue through an additional parameter (which I think is a an appropriate way!), I was thinking of a command line parameter (similar to e.g. "-j '\n'" if you want our module to return all key-vale pairs as part of the pre-hash string on a new line for better readability). Note that an eventIDfield can also be populated by a UUID, or even be empty. So, with that in mind, what is your take on this alternative solution variant?

sboeckelmann commented 1 year ago

Having this parameter is not necessarily wrong. But the root cause, which led @dakbhavesh to making this feature request (repo specific fields) must be taken care of by the repo itself and not resolved by adding a feature to the hash generator. ;-) Because the repo in this case, simply doesn't return the exact event that was captured and used for generating the hash id. I will pick this up in a separate discussion with @dakbhavesh

dakbhavesh commented 1 year ago

Dear @RalphTro & @sboeckelmann ,

Posting here the proposed idea we brainstormed for reference:

{
  "@context": [
    "https://ref.gs1.org/standards/epcis/2.0.0/epcis-context.jsonld",
    {
      "example": "http://openepcis.io/schema.jsonld", // vendor-specific schema
      "hashGen": "http://some.well.known.domain/schema.jsonld"
    }
  ],
    "hashGen:ignoreFields": ["hashGen:hash", "example:captureID"],
    "example:captureID": "",
    "hashGen:hash": "",
    .....
}

The above structure ensures that EPCISQueryDocument remains JSON-LD compliant.

Incorporating the above will ensure consistency of hash between captured and query document which is the goal for raising this issue.

RalphTro commented 1 year ago

Dear @dakbhavesh ,

Many thanks! Based on your ideas, I just compiled a concrete example we could use for illustration purposes. Voilà:

{
    "@context": [
      "https://ref.gs1.org/standards/epcis/2.0.0/epcis-context.jsonld",
      {
        "example": "https://ns.example.com/epcis/",
        "hashGen": "https://some.well.known.domain/schema.jsonld"
      }
    ],
    "type": "EPCISQueryDocument",
    "schemaVersion": "2.0",
    "creationDate": "2005-07-11T11:30:47.0Z",
    "epcisBody": {
      "queryResults": {
        "subscriptionID": "32d2aec1-a6d2-46d9-900a-24124288cce1",
        "queryName": "SimpleEventQuery",
        "hashGen:ignoreFields": ["hashGen:hash", "example:captureID"],
        "resultsBody": {
          "eventList": [
            {
                "type": "ObjectEvent",
                "eventTime": "2019-10-21T14:45:00Z",
                "recordTime": "2023-06-02T07:03:16.073Z",
                "eventTimeZoneOffset": "+01:00",
                "eventID": "ni:///sha-256;b3d481c5590757ab8361a6cb0e924820feb3b61eb568e7d7703f21a96f76f51d?ver=CBV2.0",
                "epcList": [
                    "https://id.gs1.org/00/040123451111111127"
                ],
                "action": "OBSERVE",
                "bizStep": "inspecting",
                "disposition": "in_progress",
                "readPoint": {
                    "id": "https://id.gs1.org/414/4012345000245/254/1"
                },
                "hashGen:hash": "ni:///sha-256;b3d481c5590757ab8361a6cb0e924820feb3b61eb568e7d7703f21a96f76f51d?ver=CBV2.0",
                "example:captureID": "cc90bbe4-7f3a-4abd-b150-d4371c72a64f"
            }
          ]
        }
      }
    }
  }

Does this resonate with you?

dakbhavesh commented 1 year ago

Dear @RalphTro , In your example above, shall we keep the values of "eventID" & "hashGen:hash" the same to avoid confusion?

sboeckelmann commented 1 year ago

Just a few thoughts:

And yes, @dakbhavesh - I also think, if hash is used for eventID, values should match. Unless we want to add some really funky edge-case example with eventTime, recordTime being ignored, which can be a valid use case for identifying certain duplication issues. But that's fetching too far and would cause a lot of confusion 😄

dakbhavesh commented 1 year ago

Hi @sboeckelmann,

I agree with your idea to ignore hashGen:* fields by default.

For your second point, I am curious, what benefit we will gain by providing an array of event hashes? Wouldn't businesses agree beforehand on supported algorithms?

RalphTro commented 1 year ago

@dakbhavesh : I think I now finally understand the purpose of the field hashGen:hash: I am correct in assuming that this conveys the event's hash value e.g. to cater for cases where an event does not have an eventIDfield or the latter is populated with a UUID? (Yesterday, I thouht that this hash also takes fields into account that should be ignored; that's why the values were different :-)) Happy to adjust it.

@sboeckelmann: As to "let's have the hash generator(s) always ignore hashGen related attributes. Which means there would be no need to include them in hashGen:ignoreFields, they shall always be ignored by default.": how should an accessing application know about that? As it is not officially specified anywhere, wouldn't you agree that it was better to make this more explicit?

As to "what do you think about being able to provide an array of event hashes": I agree with Bhavesh: what is the benefit of this? If there is a need for going for another sha, a company/industry consortium can choose to use it in the eventIDfield today.

sboeckelmann commented 1 year ago

@RalphTro : always ignore hashGen related attributes The reason why I was suggesting not to add hashGen extension related fields explicitly is, because hashGen will only be active if

  1. Server supports this extension
  2. Client explicitly asks for it by adding the extension to the GS1-Extensions HTTP Request Header

which means these fields don't show up automagically. A client will always be aware of hashGen related implementation details. I do understand that calling them out explicitly may be helpful and I wouldn't be opposing against it. However, there are already a couple of fields being implicitly irgnored by the hash-generator implementation(s) (eventId, recordTime and error declaration). If we want to be 100% explicit about ignored fields, these fields should also be taken into consideration. My personal opinion is, that fields that are anyway always or implicitly ignored by the hash generator may not have to be included in ignoreFields, because clients will be aware of the hashGen extention's implementation details.

@RalphTro , @dakbhavesh array of event hashes For the initial vanilla use cases it may seem correct that static setting of hash algorithm may be sufficient. Especially when having in mind, that most commonly we will see usage of sha256. I think we are safe to support single hash value only if a client can request a specific algorithm by declaring it in a specific HTTP Request Header. I am not convinced that EPCIS query request being made will only be targeting for a single company, industry or consortium. Just think of a transport agency that wants to provide EPCIS services to their customers. Their customers may be coming from a wide variety of industries, having different hash algorithm requirements. In the future we may have even more EPCIS services, that are not bound to a single industry. If EPCIS 2.0 Semantic Web and Linked Data is being picked up and adopted by the industries in such a way that I am hoping for, there shouldn't be a built-in limitation.

RalphTro commented 1 year ago

As to hashGen fields: I got your point. If the client specifically requests these fields, I tend to agree with you. (+ considering to add "automagically" to my vocabulary ;-))

As to event hashes array. TBD/need more time to think this through. How do you like the idea to wait at least for a first implementation in industry where multiple hash algorithms are required?

sboeckelmann commented 1 year ago

Re: multi hash array We can keep it as a single hash for now. But we have to add some functionality for the client to set desired algorithm.

RalphTro commented 7 months ago

Dear @dakbhavesh and @sboeckelmann , Suggestion for tomorrow's call: let's try and implement support for this feature. :-) Here is how it should work from my POV:

Report back to the Visibility SMG about this matter once we have implemented it.

WDYT?

sboeckelmann commented 7 months ago

Report back to the Visibility SMG about this matter once we have implemented it.

WDYT?

I like the idea of reporting back to the SMG, let's discuss with Craig tomorrow. The EPCIS Event Hash get more attention and awareness from the SMG.

dakbhavesh commented 7 months ago

I just got back in the context of the topic :) As we agree with the approach, I believe I can pick this up for adjusting logic to ignore the fields specified in ignoreFields. Working logic may not be present by tomorrow however I will try to get it working as soon as possible, most likely in a week's time. Would that be okay?

RalphTro commented 7 months ago

Dear @dakbhavesh , This would be fantastic! Many thanks for your kind offer to advance this. I am happy to support through testing. Looking forward to speaking with you tomorrow!

dakbhavesh commented 7 months ago

Dear @RalphTro , @sboeckelmann ,

Summarising our discussion here:

Please feel free to add any missing points on top of the above. Also, feel free to ask in case of any queries :)

RalphTro commented 7 months ago

Dear @dakbhavesh , This sounds like a great summary. Agreed + thanks!
To enable better understanding ('hashGen' may be confusing to some people ;-)), we may could use a different URI prefix than 'hashGen', e.g. 'repository-x:ignoreFields' (tied to e.g. 'repository-x.example.com'). This might convey of what we have in mind in a better way, as we could explain that fields such as 'captureID' were added by a fictitious EPCIS repository solution named 'repository-x'.
What is your take on this? Kind regards, Ralph

dakbhavesh commented 7 months ago

Hi @RalphTro, I agree with your suggestion to keep the prefix and URL as you suggested. It is a better alternative than some technical name like hashGen.

dakbhavesh commented 6 months ago

Dear @RalphTro, I am almost there with adjusting logic. Thought to keep you posted on progress. I am going to create a PR soon once sufficient testing and test cases are in place. Thanks for your patience :)

dakbhavesh commented 5 months ago

Dear @RalphTro, I have adjusted the logic to incorporate support for ignoring fields from EPCIS XML document and query documents. Please have a look at PR and feel free to give it a try.

Regarding @Echsecutor proposal to add the ignored fields into the NI. I think we should take it up separately as it may make the generated hash look complex.

For example, we need to use a fully qualified name in NI for the ignored fields <example:testField1/> & <example:testField1/> ni:///sha-256;b3d481c5590757ab8361a6cb0e924820feb3b61eb568e7d7703f21a96f76f51d?ver=CBV2.0&ignoredFields={https://ns.example.com/epcis/}testField1,{https://ns.example.com/epcis/}testField2

Please share your thoughts.

FYI: @sboeckelmann , @Echsecutor

RalphTro commented 5 months ago

I suggest that we keep this issue open until we have reached consensus in the course of our next alignment call.

RalphTro commented 4 months ago

For now, reached consensus. Additional thought: if EPCIS events need to be processed stand-alone afterwards, the repository-specific fields (indicated in ignore-fields) need to be redacted/stripped off before passing them further.

Notwithstanding the latter, in an EPCIS query doc, it makes more sense to actually convey that information once, i.e. in the document (no in each and every event).