building-envelope-data / api

API specification to exchange data about building envelopes
MIT License
3 stars 1 forks source link

Questions about DataApproval in database.graphql #187

Open StephenCzarnecki opened 3 years ago

StephenCzarnecki commented 3 years ago

We have some questions about the DataApproval type that seem like they deserve their own issue. This is mainly discussing the DataApproval in particular but should hopefully extend to the Approval type as well since the DataApproval is an implementation of Approval. Currently focusing on just the DataApproval section however since the IGSDB has the use case of offering NFRC and AERC approvals.

These initial questions revolve around the query and response fields. First to quote from the database.graphql document about the process for institutions to approve data:

Steps to approve data:
1. An institution adds data to a database.
2. Some institution (may be the same) queries the data with a GraphQL query.
3. The latter institution reviews the data and, if correct, signs it with one
   of its GnuPG signing keys.
4. The institution adds its approval of the data to the database.
Storing the GraphQL query with the signature is necessary because it needs to be
known exactly which JSON data was signed.

Next a concrete example from the IGSDB. Please correct any of the following if it is mistaken in terms of the DataApproval process.

Currently the IGSDB has two potential institutions that can be sources of approval for data in the IGSDB: NFRC and AERC. Each institution is free to potentially define which pieces of data are used as part of their approval. So if some record is approved by both the NFRC and the AERC the fields used by each institution may not be the same.

And, more granularly, each is potentially free to define which data are used for different product types. So AERC may potentially require some data for say woven shades and different data for perforated screens.

These specific fields that each institution requires for each product are stored in the query field for that specific approval for that specific record. So even if an institution changes the data used in its approval process it can still be verified that previous approvals were valid because the fields that were used at the time are recorded in the query.

Next some questions, assuming that the above example is at least mostly correct.

Question 1: It seems the query mentioned in step 2 of the approval process will include data that is not used by the ICON metabase itself. Because step 3 is the institution’s review process and that may involve other data. For example the NFRC approval process involves the measured wavelength data. That is data present in the IGSDB and can be retrieved by using the locator field in the resources section of the OpticalData response type. But that measured data is not actually used by the ICON metabase. Is this correct?

Question 2 (if question 1 is correct): Since the query that is stored for the review process is a GraphQL query, and that may contain data that is not used by the ICON metabase, does this mean that the client databases must implement a GraphQL API for at least the data required by the approving institutions?

If the previous example is correct this seems to be the case. Because even though, as @simon-wacker noted in issue #186, the ICON metabase will never actually execute the query itself IGSDB still needs to implement it in order for this DataApproval functionality. For example if the AERC approval process involves the BSDF data then it seems the IGSDB will need to provide GraphQL access for the BSDF data.

Question 3 (if question 1 is correct): Is the data stored in the response field the json returned by executing the query? If the query does involve measured data then this response is potentially quite large and seems to be the sort of data that the ICON metabase did not want to deal with. To continue the above example if the AERC approval process involves the BSDF data then would that be contained in the response field as well?

Finally an attempt to create an example approvals field in an OpticalData object returned by the IGSDB for most general case currently possible: A record that is approved by both the NFRC and the AERC (even if such a case does not currently exist). Depending on how much of the above is mistaken please feel free to disregard.

“approvals”: [
{
“timestamp”: "2020-10-02T15:54:55-04:00", # Timestamp for when the data was approved by NFRC
“signature”:  “t1dqz6IwCfQ7wP6...”, # Some cryptographic string created by using the NFRC signing key and the result of executing the query
“keyFingerprint”: “0D69 E11F 12BD...”, # The fingerprint for the NFRC signing key
“query”: “some_graphql_query_including_measured_data”, #A query whose response contains all of, and only, the data used for the NFRC approval process for this record
“response”: “some_json_including_measured_data”, #The result of executing the query
"approverID": "e41f62f3-b9bc-4f8e-b684-62ae9e84d338" #The uuid for NFRC in the ICON metabase
},
{
“timestamp”: "2019-05-02T12:54:55-04:00", # Timestamp for when the data was approved by AERC
“signature”:  “YUtgiTGgyt...”, # Some cryptographic string created by using the AERC signing key and the result of executing the query
“keyFingerprint”: “1FD3 2114 FFE4...”, # The fingerprint for the AERC signing key
“query”: “some_other_graphql_query_including_measured_data”, #A query whose response contains all of, and only, the data used for the AERC approval process for this record
“response”: “some_other_json_including_measured_data”, #The result of executing the query
"approverID": "6a176dab-0490-4af6-a12e-839254e9ea16" #The uuid for AERC in the ICON metabase
}
]
simon-wacker commented 3 years ago

I believe I can answer all questions with one clarification. Right now, the way the GraphQL API is designed, with a GraphQL query, you cannot ask for specific values within the resource holding the actual data (other than the values that are "mirrored" like nearnormalHemisphericalTransmittances; but these values are not meant to make approving institutions select specific values to approve). In other words, an approval is for all data and could merely exclude some meta information about the data like the description field (you could exclude that from the GraphQL query). The resource itself is indirectly included in an approval through the resource's SHA256 hash value (this one should be part of every query used in an approval).

If we need to be able to approve only some values within the resource, then the current approach to approvals does not work and we need to think of something else.

Also what is missing from approvals right now is a statement. So what does an approval by, for example, NFRC mean? This could be implicit but maybe it is better to make such a statement explicit.

If you feel that one of the questions is still open, then please let me know and I'll give it another try.

StephenCzarnecki commented 3 years ago

@simon-wacker Thank you, I think your clarification helps immensely. Let me try to create an example running through the data approval process to make sure. Again please correct any of the following.

Say that the NFRC wishes to approve the data returned by this url:

https://igsdb.lbl.gov/api/v1/products/363

Side note: currently that url returns product data in the existing IGSDB json format. But in the future the IGSDB may add something like a format option in the query string to allow for returning optical data that conforms with the ICON opticalData.json schema like

https://igsdb.lbl.gov/api/v1/products/363?format=ICON

Or potentially implement an entire graphQL api for the data. Since this is uncertain and does not exist yet for this example I would like to just use the existing url as the resource.

End side note.

In any case for this example assume that there is data including measured wavelength data behind https://igsdb.lbl.gov/api/v1/products/363

Data Approval step 1 (An institution adds data to a database):

Data approval step 2 (Some institution (may be the same) queries the data with a GraphQL query.):

Data approval step 3 (The latter institution reviews the data and, if correct, signs it with one of its GnuPG signing keys.):

Data approval step 4 (The institution adds its approval of the data to the database.)

After step 4 is complete the IGSDB is able to generate the following DataApproval response for UUID 12ba0b75-a6a5-424b-a01f-7aae665482ac when requested:

{
  “timestamp”: "2021-02-21T12:00:00-08:00",
  “signature”: “iHUEABEIAB0WIQQg8IURMSKBL/r7WqqDBHuHYt2u2AUCYDlItAAKCRCDBHuHYt2u2H3PAP9D+JCzwHdCfKqRX9n0zm1qwiqWNwfTEE5xVJz2aJff2gEAtpSU0YBrSXmRwWuAhwb9iSxzGkacFac4D7hy7q2PQ0E==fDo4”,
  “keyFingerprint”: “15E4544A88EEB81EAF65229038CEC5E499AE24A9”, # Assume that this is the fingerprint for the NFRC signing key.
  “query”: “query{data(id:"12ba0b75-a6a5-424b-a01f-7aae665482ac",timestamp:"2021-02-21T12:00:00-08:00",locale:"en-GB"){name,warnings,resources{hashValue,locator}}}”,
  “response”:“{“data”:{“name”:"Generic Clear Glass",“warnings”:[],“resources”:[{“hashValue”: “bca45733b010c3b0b8f940dd7f878ce9a679210d449768fa4dd55692684b64db”,“locator”:https://igsdb.lbl.gov/api/v1/products/363/}]}}”
}
simon-wacker commented 3 years ago

Regarding the question

Is this the same timestamp as the timestamp in the query in step 2?

in step 4: Yes, it's just for convenience, so that it need not be extracted from query property of an approval. The timestamp is for example needed, when a person wants to query for example the data format from the metabase (IKDB) as it was at the time the approval process took place.

I would require data id and timestamp to be included in the query in step 2. Otherwise, the GnuPG signature does not associate resource data with the unique data identifier and a specific time (note that the meta information about the approval, that is, the JSON you posted in step 4, is itself not signed). So, not including id and timestamp could be problematic because in that case the signature of the approval could also be used for another data record with a different id and another timestamp.

In the comment Approval#query it actually says

It does neither include other data approvals by third parties nor the response approval by the database. All other fields and sub-fields of this GraphQL schema at the time given by timestamp are included. Despite these restrictions specifying the query explicitely is necessary because approvals shall not become invalid when the GraphQL schema changes.

By changes here I meant non-breaking changes like adding additional fields or renaming a field by marking the old one as deprecated and adding a new one with the new name. Similar to what I said above, the rationale behind that requirement was that not only the data itself is signed but also how it came about by including appliedMethod, who measured or simulated the data by including creatorId, which exact format it is in by including formatId, and so forth to make sure that the same signature cannot be used maliciously as approval of the resource data with reference to another data formatId than in the original approval which could change the whole meaning of the signed data.

Again, I hope these somewhat confused explanations make sense (my mind is rather unfocused these days).

simon-wacker commented 3 years ago

Oh, and what I forgot: The example is exactly how I envisioned approvals to be created. There are also some explanations on other aspects in the comment Approval, in particular, some best practices for databases on checking approvals before adding them. And in the fields of the interface Approval further explanations, for example, on how to compare responses.

simon-wacker commented 3 years ago

And, I'm happy if you see any shortcomings, vulnerabilities, and what not. No other computer scientist has taken a thorough look on those ideas so far.

StephenCzarnecki commented 3 years ago

@simon-wacker Thank you for your additional clarifications, they are again very helpful. I will not have time to put together an updated example before the meeting tomorrow but have one (hopefully quick) question that came up during a discussion of your posts. And on reflection maybe this deserves its own issue, not sure.

How might a change of version of the optical data impact the approvals? For example lets assume that the IGSDB resource used by ICON is returning data based on the ICON opticalData.json schema. And that response is what is used in the DataApproval.

Then at some point in the future the opticalData.json schema changes. Presumably IGSDB would implement those changes to remain compliant. But would that then invalidate the existing approvals and require new ones to be created?

To put this in some context we recently ran into a similar issue that caused some extra work. We have some THERM files that are used by Radiance to create some BSDF results. Calculating those BSDF results takes hours per file. To prevent the need for recalculating those results are signed with a hash of the THERM files used to create them. Without going into too much detail there was a recent change in THERM that did not affect any results but ended up requiring the recalculation of all of the BSDF results. And while we fixed it by rolling back part of the change to no longer require recalculating existing files it did leave us with a renewed appreciation to issues regarding signing and versioning data.

Another reason for asking this is the existing IGSDB REST API at least nominally includes a version in the url. It is the “v1” in the example https://igsdb.lbl.gov/api/v1/products/363

We have briefly discussed a couple different approaches for being able to serve data in the optical data based on the ICON opticalData.json schema. Either adding a format flag to the existing api like

https://igsdb.lbl.gov/api/v1/products/363?format=ICON

Or creating a new API path like

https://igsdb.lbl.gov/api/ICON/products/363 or (with version) https://igsdb.lbl.gov/api/ICON/v1/products/363

But if any of those are the locator in the resource returned by the DataApproval query created in step 2 like

“resources”: [
      {
        “hashValue”: “bca45733b010c3b0b8f940dd7f878ce9a679210d449768fa4dd55692684b64db”, # SHA256 hash of the body of the response from the locator url
        “locator”: https://igsdb.lbl.gov/api/v1/products/363?format=ICON
      }

How will those approvals be handled if the opticalData.json schema changes? Because if the opticalData.json schema changes then the response from the locator query may no longer match the stored hashValue.

In terms of the two approaches we have discussed so far for returning data in the ICON format (format query string parameter vs different url) the reason was to attempt to provide data conforming to the ICON schemas without having to implement a completely parallel GraphQL API along side the existing REST API. The thought was it might be easier to meet the timeframe if IGSDB could “simply” have an additional serializer that conformed to the ICON schema.

Then the IGSDB GraphQL implementation could be limited to what is described in database.graphql. It may be the case in the future that the IGSDB moves to a full GraphQL implementation. But the thinking is that it may be easier in the present to have a limited GraphQL implementation plus the ability to serialize data in the ICON format than to maintain two complete parallel APIs.

However it initially seems that, regardless of the approach, the DataApproval process depends on the version used at the time of approval which may become deprecated and potentially removed in the future. And we are wondering how that may be handled.

simon-wacker commented 3 years ago

Yes, if the schema changes in a non-backwards compatible way and the data is transformed to conform to the new schema, then the hash value changes and all approvals would need to be recreated to still be valid.

Scenario 1: The schema changes in a backwards compatible way, for example, some new non-required property is added or some property is deprecated but still left in the schema. In that case nothing needs to be done. In that case only the minor or patch version of the schema would change. I would version the schema according to semantic versioning.

Scenario 2: The schema changes in a non-backwards compatible way, for example, a property is removed or renamed or made required or ... In that case, I would give the schema a new major version by changing its $id. Note to myself: I need to add the major version to the $id. Something like https://www.buildingenvelopedata.org/schemas/v1/opticalData.json.

This requires the format query parameter to be versioned (or an additional parameter version) and for the backend to be able to return data conforming to an old major schema version. For example, https://igsdb.lbl.gov/api/v1/products/363?format=ICON&version=1.