dlcs / iiif-presentation

Allows for the creation and management of IIIF manifests

MIT License

0 stars 0 forks source link

investigate: what is stored in S3 #28

Open JackLewis-digirati opened 1 month ago

JackLewis-digirati commented 1 month ago

For IIIF Collections and Manifests, we store a blob of JSON in S3. This JSON is the public IIIF Presentation API version of the resource, but without the id property. It does not contain any of our "extras" like slug, parent, tags etc.

For IIIF Collections, the stored JSON includes the immediate child collections and manifests in the items property (ignore for now, we'll come back the details of items and its relationship with the database in part 2). (note - what about the id properties of the contained items?)

When a request is made for the public hierarchical version of a resource, the platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier. The platform loads that JSON from S3, and inserts the id field (which will be the same as the public request URI).

When a request is made for the flat version of the URI but without the X-IIIF-CS-Show-Extras header, the response is the same as above, except that the id value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.

When a request is made for the flat version with the X-IIIF-CS-Show-Extras header, the platform loads the JSON from S3, adds in the id as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.

key in s3 is {customer id}/collections/{flat id} (and later the same pattern for /manifests/). Later still we will complicate this simple rule to avoid having millions of objects at the same level.

donaldgray commented 1 month ago

Protagonist implementation comparison. For info.json we store 'id' properties defaulted to a value and rewrite the values on the way out. E.g.

Setting default value when saving to s3: https://github.com/dlcs/protagonist/blob/91633e0a8b1d23c34f9ebeb3b1970682c8ef115b/src/protagonist/Orchestrator/Features/Images/ImageServer/InfoJson3Constructor.cs#L53
Writing correct value on way out: https://github.com/dlcs/protagonist/blob/910edabba073406761b1e9e699758d4dd1f369a8/src/protagonist/Orchestrator/Features/Images/Requests/GetImageInfoJson.cs#L143

tomcrane commented 1 month ago

what about the id properties of the contained items?

For a IIIF Collection, information about the contained items (Manifests and Collections) and their order is stored in the database. So a very simple implementation could just store the IIIF Collection JSON in S3 with an empty "items": [] property, and populate the references entirely from the database as the JSON is decorated on the way out, using the label values and the item order, so we'd end up with:

"items": [
   { 
      "id": "dlcs.io/iiif/99/coll1/mf1",
      "type": "Manifest",
      "label": { "en": [ "label for mf1 from mf1's DB row"] }
   },
   { 
      "id": "dlcs.io/iiif/99/coll1/mf2",
      "type": "Manifest",
      "label": { "en": [ "label for mf2 from mf2's DB row"] }
   },
   { 
      "id": "dlcs.io/iiif/99/coll7",
      "type": "Collection",
      "label": { "en": [ "label for coll7 from coll7's DB row"] }
   },
]

(and maybe thumbnails too)

However...

The reason this is insufficient is that people might have gone to great lengths adding a lot more stuff to these references.

The references may have different label values in the Collection JSON than they do in their referent Manifest or Collections (e.g., to help drive navigation)
They may have seeAlso and other linking properties, or annotations, or indeed anything - things that might also be on the full Manifests or Collections, but might only be on these references
The references might be a mixture of DLCS-managed Manifests or Collections, and external ones

So we need to store the JSON that was submitted to preserve all this - and in fact it is this JSON that drives the ordering and containment relationships in the DB (where the referred-to items are Manifests and Collections managed by the DLCS). The server reads the received JSON and updates things like item_order in the tables (which may have gaps in the values if some things are external).

tomcrane commented 1 month ago

Possible what is stored for nested id refs

"id": "managed:c1",
"items": [
   { 
      "id": "managed:m1",
      "type": "Manifest",
      "label": { "en": [ ""] }
   },   
   { 
      "id": "managed:c2",
      "type": "Collection",
      "label": { "en": [ ""] }
   },
   { 
      "id": "managed:m2",
      "type": "Manifest",
      "label": { "en": [ ""] }
   },
   { 
      "id": "https://bl.uk/manuscripts/27",
      "type": "Collection",
      "label": { "en": [ ""] }
   },
]

But... what happens to the JSON on disk items property of The Lord of the Rings when I add a new volume (Manifest) to it via the UI - as a containment relationship?

stephenwf commented 1 month ago

Could this be brute-forced with an ID mapping table. So when we ingest something with an identifier we might want to map (e.g. Manifest IDs or Collection IDs) we chuck it in a simple table (idx, from_id, managed_id, customer)

That would allow you to:

Let the users potentially resolve conflicts
Rewrite on disk to "managed ids" without losing the original ID

Possibly relevant: https://github.com/digirati-co-uk/headless-static-site/blob/main/src/commands/build/4-emit.ts#L49-L61

We faced some similar problems on the headless static site, which takes a folder of IIIF (or IIIF urls), writes them to disk as a static IIIF repository - and joins them together with IIIF Collections.

p-kaczynski commented 1 month ago

For IIIF Collections and Manifests, we store a blob of JSON in S3. [My assumption: IIIF Collections and Manifests are not "Storage Collection"]

What are possible ways of this "happening"?

Create via "create new" button [or other interaction "from nothing"]
Import of existing JSON that represents some collection or manifest?

The latter seems to be more complex?

we need to store the JSON that was submitted to preserve all this

So if presented with a JSON as "hey, this is my collection/manifest, ingest it", we save it "as is" to S3.

server reads the received JSON and updates things like item_order in the tables

This says "updates". What is the "create" process? What do we create in DB? How much of it is already in code (i.e. is there already a process of ingestion?)

Now, the ticket specifiec 3 "read" scenarios:

[GET] public hierarchical version of a resource

Platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier

So, we get /alpha/beta/gamma and we resolve it to e.g. collection id=123.

The platform loads that JSON from S3, and inserts the id field (which will be the same as the public request URI).

id for the resource requested, so 1 (one) property is added.

[!IMPORTANT]
Can there be a property name collision? Can the initially provided and stored "as is" JSON already have an id property? What then?

[GET] flat version of the URI but without the X-IIIF-CS-Show-Extras header

Response is the same as above, except that the id value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.

Here request is sth like ???/123 -> collection 123 -> S3 -> insert the ???/123 as id -> User?

[GET] flat version with the X-IIIF-CS-Show-Extras header

Platform loads the JSON from S3, adds in the id as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.

[!IMPORTANT]
Is this the call that also requires Authentication header?

[!IMPORTANT]
Same as previously: can there be a collision? How is this resolved?

JackLewis-digirati commented 1 month ago

For this, the ID is not saved into S3 as it comes from the URL in PUT, and is generated (with collision tests) when using POST, we also only allow valid IIIF Collection fields to be saved (based on the iiif-net package) - this means that if somebody did try and store an ID property, it would be essentially ignored and not saved in S3

For now I've set all calls to add data to require the Authentication header, but the retrieval by flat does not (so 3 would require Auth)

Might need some confirmation on if this behaviour is correct, but this is how it currently works

p-kaczynski commented 2 weeks ago

What was done

New static class API.Converters.Streaming.StreamingJsonProcessor
- Exposes static void ProcessJson(Stream input, Stream output, long? inputLength, IProcessJson implementation)
- Provides reusable high-performance UTF-8 JSON stream processing
- DOES NOT validate on write, however
- DOES, by necessity, validate on read (Utf8JsonReader).
- ergo, any changes by IProcessJson MUST be valid JSON
New abstract class StreamingProcessorImplBase<T> : IProcessJson
- Implements IProcessJson.OnToken and calls default virtual implementations that, if not overridden, will rewrite the JSON to the output stream
New class S3StoredJsonProcessor(string requestSlug) : StreamingProcessorImplBase<S3StoredJsonProcessor.S3ProcessorCustomState>
- Uses the aforementioned "generic" processing ability with a specific goal: Add/change top-level "id" property to the "requestSlug" provided"

Hooked this new code in the S3 JSON read:

if (!objectFromS3.Stream.IsNull())
{
  using var memoryStream = new MemoryStream();
  using var reader = new StreamReader(memoryStream);
  StreamingJsonProcessor.ProcessJson(objectFromS3.Stream, memoryStream,
      objectFromS3.Headers.ContentLength, new S3StoredJsonProcessor(request.Slug));
  collectionFromS3 = await reader.ReadToEndAsync(cancellationToken);
}

p-kaczynski commented 2 weeks ago

Implementation details

S3StoredJsonProcessor overrides default methods: OnPropertyName, OnString and OnEndObject.

OnPropertyName keeps track of "current" property being processed
OnString is called when a string property value is being processed. If the reader.CurrentDepth is 1, meaning we are in the top-level object, and the current property is "id" - we rewrite it as per requirements.
OnEndObject- if the reader.CurrentDepth is 0, we have just finished reading in the top-level object. We check flag to see if the "id" was REWRITTEN. If not, then we add the "id" property with requestSlug value BEFORE writing the end of the object.

Performance


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2033)
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]             : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  .NET 8.0           : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  NativeAOT 8.0      : .NET 8.0.10, X64 NativeAOT AVX2

InvocationCount=1  UnrollFactor=1

Method	Job	Runtime	Mean	Error	StdDev
BenchmarkStj	.NET 8.0	.NET 8.0	84.78 ms	1.672 ms	3.705 ms
BenchmarkNewton	.NET 8.0	.NET 8.0	357.93 ms	6.791 ms	7.548 ms
BenchmarkStj	NativeAOT 8.0	NativeAOT 8.0	115.86 ms	2.630 ms	7.461 ms
BenchmarkNewton	NativeAOT 8.0	NativeAOT 8.0	550.64 ms	10.972 ms	21.139 ms

Compared the implementation above (based on System.Text.Json) to the (admittedly simpler) Newtonsoft token processing on a ~10.5MB Manifest, and STJ performed 4-5x faster.