Use JSON-LD schemas from schema.org

ferrisoxide commented 1 year ago

The lack of any formal structure in the data has always bothered me, as we basically store everything in one blob of JSON data. In the general case this is fine, but when products/items have very definite properties (e.g. a Book has an author, food items have calories, etc) it gets harder to ensure that the data makes sense - or is consumable in a repeatable way.

Proposal

Brocade.io to start using JSON-LD as the base model for all data presented via the API, using schemas published by third-parties like https://schema.org.

For instance, if we adopt the "Product" type from schema.org, we can structure product information that is both human-readable and easily processed by applications:

{
  "@context": "https://schema.org/",
  "@type": "Product",
  "name": "Lite Italian Dry Salami",
  "gtin": "00073007107096",
  "countryOfAssembly": "USA"
  "brand": {
    "@type": "Brand",
    "name": "Columbus"
  },
  "material": "processed meat"
}

JSON-LD also enables us to add a graph for extended attributes, e.g. if nutritional information is available for a product we can use the NutritionInformation type to present these attributes in a structured manner:

{
  "@context": "https://schema.org/",
  "@type": "Product",
  "name": "Lite Italian Dry Salami",
  "@graph": [
    {
      "@type": "NutritionInformation",
      "calories": "214 kcal",
      "servingSize": "28 g",
      ...
      }
    },
    ...
}

Other types from schema.org can be used as applicable (e.g. Book, Movie), etc. We can also make use of schemas published by other third parties - or our own custom types - as required and potentially "future proof" the underlying data model.

We can also use the type information in the frontend, using schema types to determine the best way to present data like nutritional information in a table-like format (see #11). We can also use to insert Microdata into the HTML to nest metadata suitable for search engines, web scrapers and the like to consume.

Benefits

Leverage existing data structures and tools
Consistent results from the API
Helps inform the presentation in the UI

Risks / Possible Problems

Harder to parse incoming data
Large amount of existing data that needs to be processed

We can mitigate the second problem by processing individual products on demand and progressively update data. The problem of parsing data requires a bit more thought and investigation, but it looks like a solvable problem.

ferrisoxide commented 1 year ago

Notes

Checking unique keys in the products.properties field:

select distinct(json_data.key)
from products, jsonb_each(products.properties) as json_data

returns

 potassium
 weight_g
 fiber
 carbohydrate
 sugars
 unit_count
 size
 author
 monounsaturated_fat
 volume_ml
 ingredients
 polyunsaturated_fat
 fat_calories
 alcohol_by_volume
 calories
 weight_ounce
 servings_per_container
 trans_fat
 format
 saturated_fat
 pages
 fat
 volume_fluid_ounce
 sodium
 serving_size
 protein
 publisher
 cholesterol

There are only a small number of keys to worry about - and most will map to schema.org types. I think I'll go ahead and start introducing JSON-LD to the API.

ferrisoxide commented 1 year ago

Notes

Per discussions on Schema.org, repeated values are fine. This is valid Product data:

{
  "@context": "https://schema.org",
  "@type": "Product",
  "additionalProperty": [
   {
      "@type": "PropertyValue",
      "name": "myCustomProperty",
      "value": "my custom value"
    },
    {
      "@type": "PropertyValue",
      "name": "myOtherCustomProperty",
      "value": "my other custom value"
    }
  ],
  ...

so we can use an array additionalProperty to record anything that doesn't fit elsewhere.

NB Can use https://validator.schema.org/ to validate data

ferrisoxide commented 1 year ago

DEV NOTE

Might be also worth getting my head around SHACL

https://www.w3.org/TR/shacl/

Also, JSON-LD Best Practices

https://w3c.github.io/json-ld-bp/#bp-summary

ferrisoxide / brocade.io