dlcs / iiif-presentation

Allows for the creation and management of IIIF manifests
MIT License
0 stars 0 forks source link

investigate: what is stored in S3 #28

Open JackLewis-digirati opened 1 month ago

JackLewis-digirati commented 1 month ago

For IIIF Collections and Manifests, we store a blob of JSON in S3. This JSON is the public IIIF Presentation API version of the resource, but without the id property. It does not contain any of our "extras" like slug, parent, tags etc.

For IIIF Collections, the stored JSON includes the immediate child collections and manifests in the items property (ignore for now, we'll come back the details of items and its relationship with the database in part 2). (note - what about the id properties of the contained items?)

When a request is made for the public hierarchical version of a resource, the platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier. The platform loads that JSON from S3, and inserts the id field (which will be the same as the public request URI).

When a request is made for the flat version of the URI but without the X-IIIF-CS-Show-Extras header, the response is the same as above, except that the id value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.

When a request is made for the flat version with the X-IIIF-CS-Show-Extras header, the platform loads the JSON from S3, adds in the id as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.

donaldgray commented 1 month ago

Protagonist implementation comparison. For info.json we store 'id' properties defaulted to a value and rewrite the values on the way out. E.g.

tomcrane commented 1 month ago

what about the id properties of the contained items?

For a IIIF Collection, information about the contained items (Manifests and Collections) and their order is stored in the database. So a very simple implementation could just store the IIIF Collection JSON in S3 with an empty "items": [] property, and populate the references entirely from the database as the JSON is decorated on the way out, using the label values and the item order, so we'd end up with:

"items": [
   { 
      "id": "dlcs.io/iiif/99/coll1/mf1",
      "type": "Manifest",
      "label": { "en": [ "label for mf1 from mf1's DB row"] }
   },
   { 
      "id": "dlcs.io/iiif/99/coll1/mf2",
      "type": "Manifest",
      "label": { "en": [ "label for mf2 from mf2's DB row"] }
   },
   { 
      "id": "dlcs.io/iiif/99/coll7",
      "type": "Collection",
      "label": { "en": [ "label for coll7 from coll7's DB row"] }
   },
]

(and maybe thumbnails too)

However...

The reason this is insufficient is that people might have gone to great lengths adding a lot more stuff to these references.

So we need to store the JSON that was submitted to preserve all this - and in fact it is this JSON that drives the ordering and containment relationships in the DB (where the referred-to items are Manifests and Collections managed by the DLCS). The server reads the received JSON and updates things like item_order in the tables (which may have gaps in the values if some things are external).

tomcrane commented 1 month ago

Possible what is stored for nested id refs

"id": "managed:c1",
"items": [
   { 
      "id": "managed:m1",
      "type": "Manifest",
      "label": { "en": [ ""] }
   },   
   { 
      "id": "managed:c2",
      "type": "Collection",
      "label": { "en": [ ""] }
   },
   { 
      "id": "managed:m2",
      "type": "Manifest",
      "label": { "en": [ ""] }
   },
   { 
      "id": "https://bl.uk/manuscripts/27",
      "type": "Collection",
      "label": { "en": [ ""] }
   },
]

But... what happens to the JSON on disk items property of The Lord of the Rings when I add a new volume (Manifest) to it via the UI - as a containment relationship?

stephenwf commented 1 month ago

Could this be brute-forced with an ID mapping table. So when we ingest something with an identifier we might want to map (e.g. Manifest IDs or Collection IDs) we chuck it in a simple table (idx, from_id, managed_id, customer)

That would allow you to:

Possibly relevant: https://github.com/digirati-co-uk/headless-static-site/blob/main/src/commands/build/4-emit.ts#L49-L61

We faced some similar problems on the headless static site, which takes a folder of IIIF (or IIIF urls), writes them to disk as a static IIIF repository - and joins them together with IIIF Collections.

p-kaczynski commented 1 month ago

For IIIF Collections and Manifests, we store a blob of JSON in S3. [My assumption: IIIF Collections and Manifests are not "Storage Collection"]

What are possible ways of this "happening"?

The latter seems to be more complex?

we need to store the JSON that was submitted to preserve all this

So if presented with a JSON as "hey, this is my collection/manifest, ingest it", we save it "as is" to S3.

server reads the received JSON and updates things like item_order in the tables

This says "updates". What is the "create" process? What do we create in DB? How much of it is already in code (i.e. is there already a process of ingestion?)


Now, the ticket specifiec 3 "read" scenarios:

  1. [GET] public hierarchical version of a resource

Platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier

So, we get /alpha/beta/gamma and we resolve it to e.g. collection id=123.

The platform loads that JSON from S3, and inserts the id field (which will be the same as the public request URI).

id for the resource requested, so 1 (one) property is added.

[!IMPORTANT]
Can there be a property name collision? Can the initially provided and stored "as is" JSON already have an id property? What then?

  1. [GET] flat version of the URI but without the X-IIIF-CS-Show-Extras header

    Response is the same as above, except that the id value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.

Here request is sth like ???/123 -> collection 123 -> S3 -> insert the ???/123 as id -> User?

  1. [GET] flat version with the X-IIIF-CS-Show-Extras header

    Platform loads the JSON from S3, adds in the id as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.

[!IMPORTANT]
Is this the call that also requires Authentication header?

[!IMPORTANT]
Same as previously: can there be a collision? How is this resolved?

JackLewis-digirati commented 1 month ago

For this, the ID is not saved into S3 as it comes from the URL in PUT, and is generated (with collision tests) when using POST, we also only allow valid IIIF Collection fields to be saved (based on the iiif-net package) - this means that if somebody did try and store an ID property, it would be essentially ignored and not saved in S3

For now I've set all calls to add data to require the Authentication header, but the retrieval by flat does not (so 3 would require Auth)

Might need some confirmation on if this behaviour is correct, but this is how it currently works

p-kaczynski commented 2 weeks ago

What was done

p-kaczynski commented 2 weeks ago

Implementation details

S3StoredJsonProcessor overrides default methods: OnPropertyName, OnString and OnEndObject.

Performance


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2033)
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]             : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  .NET 8.0           : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  NativeAOT 8.0      : .NET 8.0.10, X64 NativeAOT AVX2

InvocationCount=1  UnrollFactor=1  
Method Job Runtime Mean Error StdDev
BenchmarkStj .NET 8.0 .NET 8.0 84.78 ms 1.672 ms 3.705 ms
BenchmarkNewton .NET 8.0 .NET 8.0 357.93 ms 6.791 ms 7.548 ms
BenchmarkStj NativeAOT 8.0 NativeAOT 8.0 115.86 ms 2.630 ms 7.461 ms
BenchmarkNewton NativeAOT 8.0 NativeAOT 8.0 550.64 ms 10.972 ms 21.139 ms

Compared the implementation above (based on System.Text.Json) to the (admittedly simpler) Newtonsoft token processing on a ~10.5MB Manifest, and STJ performed 4-5x faster.