Open JackLewis-digirati opened 1 month ago
Protagonist implementation comparison. For info.json we store 'id' properties defaulted to a value and rewrite the values on the way out. E.g.
what about the id properties of the contained items?
For a IIIF Collection, information about the contained items
(Manifests and Collections) and their order is stored in the database.
So a very simple implementation could just store the IIIF Collection JSON in S3 with an empty "items": []
property, and populate the references entirely from the database as the JSON is decorated on the way out, using the label values and the item order, so we'd end up with:
"items": [
{
"id": "dlcs.io/iiif/99/coll1/mf1",
"type": "Manifest",
"label": { "en": [ "label for mf1 from mf1's DB row"] }
},
{
"id": "dlcs.io/iiif/99/coll1/mf2",
"type": "Manifest",
"label": { "en": [ "label for mf2 from mf2's DB row"] }
},
{
"id": "dlcs.io/iiif/99/coll7",
"type": "Collection",
"label": { "en": [ "label for coll7 from coll7's DB row"] }
},
]
(and maybe thumbnails too)
However...
The reason this is insufficient is that people might have gone to great lengths adding a lot more stuff to these references.
label
values in the Collection JSON than they do in their referent Manifest or Collections (e.g., to help drive navigation)So we need to store the JSON that was submitted to preserve all this - and in fact it is this JSON that drives the ordering and containment relationships in the DB (where the referred-to items are Manifests and Collections managed by the DLCS). The server reads the received JSON and updates things like item_order
in the tables (which may have gaps in the values if some things are external).
Possible what is stored for nested id
refs
"id": "managed:c1",
"items": [
{
"id": "managed:m1",
"type": "Manifest",
"label": { "en": [ ""] }
},
{
"id": "managed:c2",
"type": "Collection",
"label": { "en": [ ""] }
},
{
"id": "managed:m2",
"type": "Manifest",
"label": { "en": [ ""] }
},
{
"id": "https://bl.uk/manuscripts/27",
"type": "Collection",
"label": { "en": [ ""] }
},
]
But... what happens to the JSON on disk items
property of The Lord of the Rings when I add a new volume (Manifest) to it via the UI - as a containment relationship?
Could this be brute-forced with an ID mapping table. So when we ingest something with an identifier we might want to map (e.g. Manifest IDs or Collection IDs) we chuck it in a simple table (idx, from_id, managed_id, customer)
That would allow you to:
Possibly relevant: https://github.com/digirati-co-uk/headless-static-site/blob/main/src/commands/build/4-emit.ts#L49-L61
We faced some similar problems on the headless static site, which takes a folder of IIIF (or IIIF urls), writes them to disk as a static IIIF repository - and joins them together with IIIF Collections.
For IIIF Collections and Manifests, we store a blob of JSON in S3. [My assumption: IIIF Collections and Manifests are not "Storage Collection"]
What are possible ways of this "happening"?
The latter seems to be more complex?
we need to store the JSON that was submitted to preserve all this
So if presented with a JSON as "hey, this is my collection/manifest, ingest it", we save it "as is" to S3.
server reads the received JSON and updates things like item_order in the tables
This says "updates". What is the "create" process? What do we create in DB? How much of it is already in code (i.e. is there already a process of ingestion?)
Now, the ticket specifiec 3 "read" scenarios:
Platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier
So, we get /alpha/beta/gamma
and we resolve it to e.g. collection id=123
.
The platform loads that JSON from S3, and inserts the id field (which will be the same as the public request URI).
id
for the resource requested, so 1 (one) property is added.
[!IMPORTANT]
Can there be a property name collision? Can the initially provided and stored "as is" JSON already have anid
property? What then?
Response is the same as above, except that the id value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.
Here request is sth like ???/123
-> collection 123
-> S3 -> insert the ???/123
as id -> User?
Platform loads the JSON from S3, adds in the id as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.
[!IMPORTANT]
Is this the call that also requiresAuthentication
header?[!IMPORTANT]
Same as previously: can there be a collision? How is this resolved?
For this, the ID is not saved into S3 as it comes from the URL in PUT, and is generated (with collision tests) when using POST, we also only allow valid IIIF Collection fields to be saved (based on the iiif-net package) - this means that if somebody did try and store an ID property, it would be essentially ignored and not saved in S3
For now I've set all calls to add data to require the Authentication
header, but the retrieval by flat does not (so 3
would require Auth)
Might need some confirmation on if this behaviour is correct, but this is how it currently works
static class
API.Converters.Streaming.StreamingJsonProcessor
static void ProcessJson(Stream input, Stream output, long? inputLength, IProcessJson implementation)
Utf8JsonReader
).IProcessJson
MUST be valid JSONabstract class StreamingProcessorImplBase<T> : IProcessJson
IProcessJson.OnToken
and calls default virtual implementations that, if not overridden, will rewrite the JSON to the output streamclass S3StoredJsonProcessor(string requestSlug) : StreamingProcessorImplBase<S3StoredJsonProcessor.S3ProcessorCustomState>
Add/change top-level "id" property to the "requestSlug" provided"
if (!objectFromS3.Stream.IsNull())
{
using var memoryStream = new MemoryStream();
using var reader = new StreamReader(memoryStream);
StreamingJsonProcessor.ProcessJson(objectFromS3.Stream, memoryStream,
objectFromS3.Headers.ContentLength, new S3StoredJsonProcessor(request.Slug));
collectionFromS3 = await reader.ReadToEndAsync(cancellationToken);
}
S3StoredJsonProcessor
overrides default methods: OnPropertyName
, OnString
and OnEndObject
.
OnPropertyName
keeps track of "current" property being processedOnString
is called when a string property value is being processed. If the reader.CurrentDepth
is 1
, meaning we are in the top-level object, and the current property is "id"
- we rewrite it as per requirements.OnEndObject
- if the reader.CurrentDepth
is 0
, we have just finished reading in the top-level object. We check flag to see if the "id"
was REWRITTEN. If not, then we add the "id"
property with requestSlug
value BEFORE writing the end of the object.
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2033)
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
[Host] : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
NativeAOT 8.0 : .NET 8.0.10, X64 NativeAOT AVX2
InvocationCount=1 UnrollFactor=1
Method | Job | Runtime | Mean | Error | StdDev |
---|---|---|---|---|---|
BenchmarkStj | .NET 8.0 | .NET 8.0 | 84.78 ms | 1.672 ms | 3.705 ms |
BenchmarkNewton | .NET 8.0 | .NET 8.0 | 357.93 ms | 6.791 ms | 7.548 ms |
BenchmarkStj | NativeAOT 8.0 | NativeAOT 8.0 | 115.86 ms | 2.630 ms | 7.461 ms |
BenchmarkNewton | NativeAOT 8.0 | NativeAOT 8.0 | 550.64 ms | 10.972 ms | 21.139 ms |
Compared the implementation above (based on System.Text.Json
) to the (admittedly simpler) Newtonsoft token processing on a ~10.5MB Manifest, and STJ performed 4-5x faster.
For IIIF Collections and Manifests, we store a blob of JSON in S3. This JSON is the public IIIF Presentation API version of the resource, but without the
id
property. It does not contain any of our "extras" like slug, parent, tags etc.For IIIF Collections, the stored JSON includes the immediate child collections and manifests in the
items
property (ignore for now, we'll come back the details ofitems
and its relationship with the database in part 2). (note - what about theid
properties of the containeditems
?)When a request is made for the public hierarchical version of a resource, the platform resolves the request path to a particular IIIF resource's flat identifier using the hierarchical query developed earlier. The platform loads that JSON from S3, and inserts the
id
field (which will be the same as the public request URI).When a request is made for the flat version of the URI but without the
X-IIIF-CS-Show-Extras
header, the response is the same as above, except that theid
value inserted into the response JSON is the flat version, and we didn't need to run the hierarchical query - we know its S3 location from the request URI alone.When a request is made for the flat version with the
X-IIIF-CS-Show-Extras
header, the platform loads the JSON from S3, adds in theid
as above, but also adds in ALL of the additional fields defined in the documentation - all of which are derived from information in the database tables.