Database design discussion

RussellMcOrmond commented 4 years ago

Discussion towards documentation in https://github.com/crkn-rcdr/Access-Platform/tree/master/Databases

Overall discussion can start here, and then refinements can happen in PR's.

RussellMcOrmond commented 4 years ago

Currently we have `dipstaging', and I created #10 to document what existing tools and update documents already fill in.

We have a design decision to make right away, which is whether we will be supporting AIP updates.

For microservices I've been using.

A "status" boolean to indicate success or failure of the request (Did it "die")
A "message" string that captures any warnings and the text from any 'die'.
The "manifestdate" of the AIP that was processed
The date when the processing happened.

If we decide to disallow processing of AIP updates, then we might not need to store the manifest date, although it is still useful information to store. Any noid's created would only happen once.

If we want to handle updates we have some decisions:

METS updates only need a new manifest noid, and we could then copy all the existing canvas noids into the new manifest. METS updates are currently done to update dmdSec or labels, all things which will be far better to do directly within the access platform.
SIP updates require us to de-duplicate against the previous SIP, copy duplicate noids, and generate new noids as required. SIP updates are currently done to add/remove/rearrange canvasses, or to add/update OCR -- also things better done directly within the access platform.

This may be a question of timing. If the tools to allow staff to do all the things that normally require an AIP update are in-place when we do the final non-test run of Smelter, then we have no reason to support AIP updates of either type.

If we want to support updates we need to store within dupstaging (namespace is AIP ids) all the noids, such that any deduplication and/or noid copying can be done. This can be a hash that uses the date or md5 of the METS as a key.

RussellMcOrmond commented 4 years ago

While a 'noid' is globally unique, there has been some discussion about separating 'collections', 'manifests' and 'canvasses' into separate CouchDB databases.

There are many advantages as far as the speed of calculating views are concerned, if we expect to never want to create a view that would cross these types. Do we know this for certain at this stage?

This would effectively be creating 3 new databases to replace the functionality of 'internalmeta'.

What 'hammer' does once 'smelter' exists will be much smaller. It will handle the various transformations from stored metadata (the 3 manifest dmdSec types, and the 1 or 2 OCR XML formats) to the schema we are using for search and presentation.
Press currently does a get of hammer.json which is a big array of things which would be in a manifest (currently index 0) and canvasses (index 1...). This becomes two gets -- one for the manifest information which includes an array of canvasses, and another using the array of canvas ID's against the canvas database to get that data. This is true if the data is within the document and not as an attachment. If we store some data, such as OCR data, as an attachment then we'll need to have a separate GET for each canvas, which might be a performance issue worth thinking about.

The question about OCR data may have a short-term answer and a long term one.

In the short term we do the same thing we currently do, which is store a single blob of search text associate with a manifest into a search document for that manifest
In the long term we determine if there is a better way to handle search that is canvas focused rather than storing search data twice (once for manifest, and again for each canvas).

If we decide to split 'internalmeta' into 3', we then need to decide if we are doing the same with 'copresentation' and 'cosearch' (and the associated Solr cores -- the solrstream tool is dumb, so doesn't care as long as there is a matching CouchDB for every Solr core).

SaschaAdler commented 4 years ago

While a 'noid' is globally unique, there has been some discussion about separating 'collections', 'manifests' and 'canvasses' into separate CouchDB databases.

There are many advantages as far as the speed of calculating views are concerned, if we expect to never want to create a view that would cross these types. Do we know this for certain at this stage?

I want to say yes. The only time we effectively merge collections and manifests is in Solr. I think we can manage with collections and manifests in separate databases so long as we do the slug uniqueness lookup in both.

This would effectively be creating 3 new databases to replace the functionality of 'internalmeta'.

What 'hammer' does once 'smelter' exists will be much smaller. It will handle the various transformations from stored metadata (the 3 manifest dmdSec types, and the 1 or 2 OCR XML formats) to the schema we are using for search and presentation.

Are you thinking that Hammer is going to take input from, and then output into, these new databases? I know it's probably too big a task to take on at once but I'm wondering if we can make this a two-step process: Smelter getting everything into position, and then Hammer-Press putting things into copresentation/cosearch. I imagine when we eventually sort out what the co* databases look like in a IIIF world, this is the approach we want to take.

Press currently does a get of hammer.json which is a big array of things which would be in a manifest (currently index 0) and canvasses (index 1...). This becomes two gets -- one for the manifest information which includes an array of canvasses, and another using the array of canvas ID's against the canvas database to get that data. This is true if the data is within the document and not as an attachment. If we store some data, such as OCR data, as an attachment then we'll need to have a separate GET for each canvas, which might be a performance issue worth thinking about.

All this time I'd had a fairly utopian vision of OCR data sitting in Swift and only getting pulled up when we need to create a derivative of it for public search purposes. Seeing as that's not so likely to be a reality any time soon, we probably want OCR data living in canvas documents directly and not as an attachment for the reasons you've laid out.

The question about OCR data may have a short-term answer and a long term one.

In the short term we do the same thing we currently do, which is store a single blob of search text associate with a manifest into a search document for that manifest

In the long term we determine if there is a better way to handle search that is canvas focused rather than storing search data twice (once for manifest, and again for each canvas).

I'm actually not sure if we'll ever be able to pull off not storing search data twice, although we certainly would only want to store positional OCR data in canvas annotations.

If we decide to split 'internalmeta' into 3', we then need to decide if we are doing the same with 'copresentation' and 'cosearch' (and the associated Solr cores -- the solrstream tool is dumb, so doesn't care as long as there is a matching CouchDB for every Solr core).

An easy split to pull off before we need to start making big changes in CAP is splitting canvases into their own databases/cores. All canvas lookups take place separately from collection/manifest ones.

Here's my own question: At what point are we going to flip manifest parent/seq trackers for their parent series record into said series' collection document? If we're trying to create a many-to-many ownership situation for collections and manifests in the access platform, it seems we'd want to make that switch as soon as we can. I'm aware that series records can and will be Smelted after their issues (unless we keep that from happening?) and so there are lots of timing issues to parse here.

Theoretically we can do our first pass of the collection graph by having every kind of collection have a member list (which will have an order, which will have meaning for series and no meaning for "tag" collections), and then writing a view that... would have to be recursive and probably can't happen. Hmm. I'll keep thinking about this.

RussellMcOrmond commented 4 years ago

Are you thinking that Hammer is going to take input from, and then output into, these new databases? I know it's probably too big a task to take on at once but I'm wondering if we can make this a two-step process: Smelter getting everything into position, and then Hammer-Press putting things into copresentation/cosearch.

The original tool had only one step -- read from AIP's and some MySQL tables, and post to Solr.

The design decision for Hammer, Press and Solrstream was to separate steps in order to improve efficiency, using databases to hold intermediary steps in data processing. Hammer would do all the thinking that was needed, and press would only glue pre-processed documents together. That way a small update wouldn't require re-doing all the processing, only the part of the processing the change impacted and then re-glue.

With Smelter this becomes 4 steps: A step which separates DIPs (AIPs now for CIHM, but DIPs with Archivematica) into the components we want to keep, a Hammer step which "smashes" the data into the format we want (any crosswalking, etc), a press which glues the already processed components together, and a Solrstream step which posts cosearch documents into Solr.

All this time I'd had a fairly utopian vision of OCR data sitting in Swift and only getting pulled up when we need to create a derivative of it for public search purposes.

This sounds like a great idea for a future post-PA-split enhancement. This to me envisions a 'solrstream' which is far more intelligent, reading references to OCR data and pasting them directly.

I think we should look at this as part of a larger project, as this might also be the time to move away from the simple 'tx' field to actually storing positional data in Solr. This would require research in how positional data is handled, and is a key component to the "search term highlighting" user request.

I'm actually not sure if we'll ever be able to pull off not storing search data twice, although we certainly would only want to store positional OCR data in canvas annotations.

This is part of the Solr work I'm hoping we can take on as a future project. Rick was doing some work with having search on sub-components of a larger whole, but we never adopted it. If we were posting manifests with sub-components containing canvas data, then the duplication would happen when the same canvas occurs in multiple manifests rather than always being duplicated.

At what point are we going to flip manifest parent/seq trackers for their parent series record into said series' collection document?

We need to plan that.

One option is for us to bulk load series records first, and then not allow series records to be smeltered at all. Any new series manipulations would need to happen within the UI provided by the access platform.

As to "recursive" views, we also have upgrading CouchDB as a future project. There are many features of the 2.x+ series that allow much better views based on the ability of having the output of one view be the input to a second view.

I know there are efficiencies in having fewer upgrade iterations, but I think getting P-A split done and getting Beth and Jason's teams actively using those tools and then on to practical use of Archivematica is worth postponing most other access platform enhancements.

SaschaAdler commented 4 years ago

With Smelter this becomes 4 steps: A step which separates DIPs (AIPs now for CIHM, but DIPs with Archivematica) into the components we want to keep, a Hammer step which "smashes" the data into the format we want (any crosswalking, etc), a press which glues the already processed components together, and a Solrstream step which posts cosearch documents into Solr.

This all sounds good. As we sort out how to consolidate our various dmd/OCR formats and run one-time operations on the files stored in Swift, Hammer can get simpler over time.

Needless to say I'm excited to get back into Solr research, when we decide we can spare the time.

One option is for us to bulk load series records first, and then not allow series records to be smeltered at all. Any new series manipulations would need to happen within the UI provided by the access platform.

I like this. It helps that series records are small and relatively easy to parse.

As to "recursive" views, we also have upgrading CouchDB as a future project. There are many features of the 2.x+ series that allow much better views based on the ability of having the output of one view be the input to a second view.

So I've done a bit more thinking on this. This example from the CouchDB 2 documentation is something we can use to emulate the collection graph in our non-graph DB. I can't find this example in the CouchDB 1 documentation and so I have no idea if emitting a different _id can be done in CouchDB 1, or if we'd need to upgrade to CouchDB 2 as part of this process. Honestly, this seems like a good time to look into that, as we're creating new databases from scratch.

There are two concerns with using this example as the basis for our collection graph:

While lookups of a given collection/manifest's ancestry are cheap, updating the graph is expensive. If we, say, add a large collection into another large collection, we will need to update a fair number of documents.
Implied in 1. is the idea that we're probably going to have to keep collections and manifests in the same database. The more I think about this, the less troubled I am by it. In the IIIF model, collections can comprise both manifests and other collections, and so it seems like the two concepts live in the same graph. We will also be performing searches on both at the same time, and so we'd have to conflate the two in Solr anyway.

If we don't like or can't adopt this idea, we could look into using a separate piece of software to handle the collection graph, and keep CouchDB for document storage. That gets complicated, of course.

I know there are efficiencies in having fewer upgrade iterations, but I think getting P-A split done and getting Beth and Jason's teams actively using those tools and then on to practical use of Archivematica is worth postponing most other access platform enhancements.

I agree that the split has to happen as soon as possible.

SaschaAdler commented 4 years ago

I've tested the CouchDB 2 example in CouchDB 1.7 and it works.

RussellMcOrmond commented 4 years ago

I created #12 based on the original idea of having 3 separate databases, so looking for comments.

I am wondering what feature you are you are looking for with the join. Is this future design work, or part of the P-A split prerequisite for the Archivematica project?

RussellMcOrmond commented 4 years ago

I believe design has been firmed up, and this issue can be closed.

crkn-rcdr / Access-Platform

Database design discussion #9