NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
25 stars 12 forks source link

use closure table for version chain #1429

Open mbjones opened 4 years ago

mbjones commented 4 years ago

Our current approach to linking versions of objects in DataONE and Metacat is to provide a pointer in the SystemMetadata for each object that points at the objects that it obsoletes and those that are obsoletedBy it. This represents a doubly-linked list that can be traversed to reconstruct version information. The obsoletes and obsoletedBy fields are in the Metacat postgres database as columns in the system_metadata table, with the abbreviated structure:

guid, obsoletes, obsoleted_by

From a hierarchical data modeling perspective, this structure is called an adjacency list, and allows one to trace the version chain through a series of queries, each of which walks the version chain either forwards or backwards. Without support for hierarchical queries (which are proprietary and perform poorly), this generally means issuing n-1 queries, where n is the length of the version chain.

Much has been written about querying hierarchical data. Some good background reading is:

From the first set of slides, and other reading, I have concluded that closure tables are a much more efficient and fast way to store and query this hierarchical data. It involves creating a new table for just the hierarchical information, and creating one row for each link in the version chain, including links at each level of the hierarchy. This table structure allows a single query to retrieve all parent and child information in the tree, and so is extremely fast, and only marginally more expensive in terms of storage. The articles linked above explain much more thoroughly than I will here, but the essences is that the closure table contains the following columns:

parent, child, depth

A simple query to get all of the versions for a given object would be:

select * from closure_table where parent = 'P1';

We'll need to develop out a more mature design, and an API to go along with this, but I think closure tables will allow us to provide efficient version information in our services with a new API. Below I attach my notes on the design and use of closure tables for reference.

closure-tables-notes.pdf

Comments and expansion of ideas appreciated. @csjx, @amoeba, @gothub, @taojing2002 @laurenwalker

rushirajnenuji commented 4 years ago

Interesting! Thank you for introducing this concept. I can see this getting utilized for the D1 Metrics Service, where we traverse back in the version chain to calculate the metrics.

amoeba commented 4 years ago

Really cool, @mbjones, I hadn't heard of a closure table before. I think having one or more O(1) APIs around version chains would be very helpful across current and future projects.

Another approach I implemented recently and mostly like are recursive common table expressions (CTE). My understanding is that they're O(n) but the O(n) happens inside the database engine so queries for even large hierarchies return in <1ms, as opposed to O(n) via HTTPS which is very expensive by comparison.

The possible benefit I see to CTEs over closure tables is that they don't require a second table and work on existing implicit hierarchies in your tables. As for downsides (from a quick web search) their performance seems to be on the same magnitude as closure tables but a bit slower.

Here's an example:

Given a table like our systemmetadata table in Metacat's database representing a chain from PIDs a->b->c:

guid obsoleted_by obsoletes
a b NULL
b c a
c NULL b

We can query for 'a's descendants (note that the string 'a' is embedded in the query below because we're querying for 'a's descendants):

WITH RECURSIVE
  children(x) AS (
    VALUES ('a')
  UNION
    SELECT guid FROM sysmeta, children
    WHERE sysmeta.obsoletes = children.x
  )
  SELECT guid FROM sysmeta
  WHERE sysmeta.guid IN (select * from children);
guid
a
b
c

A recursive CTE is really effectively asking the query engine to walk from row to row, forming unions with matched rows as it goes. Or at least that's my basic understanding. Food for thought.

mbjones commented 4 years ago

Thanks @amoeba for the thoughts on CTEs. In my post when I referred to "support for hierarchical queries", I was referring to CTEs, which were introduced in SQL:99 and are described on slide 17 of Karwin's slideset that I linked. Historically, each RDBMS has implemented the syntax and algorithm for CTEs in a proprietary way, even though they are all very similar. Postgres 9.5 first provided the SQL:99 version of WITH, whereas Oracle used to use CONNECT BY, and now implements both. Mysql also recently implemented them in their 2018 8.0.11 release. So it seems CTEs are widespread enough now to be used across platforms.

The tradeoff is this: CTE-based queries are much more complex to write; while closure tables are more complicated to maintain (probably involving triggers). But using closure tables, we can use really fast and simple queries to get complex results. Here's a quick example of some queries we would probably want to support via API.

create table version_closure (
    ancestor text NOT NULL, 
    descendant text NOT NULL, 
    depth integer, 
    PRIMARY KEY (ancestor, descendant)
);
INSERT INTO version_closure VALUES
    ('P1', 'P1', 0),
    ('P2', 'P2', 0),
    ('P3', 'P3', 0),
    ('P1', 'P2', 1),
    ('P2', 'P3', 1),
    ('P1', 'P3', 2);

So it's that simplicity that I like. But it comes at the expense of having to maintain the closure table. Even a year or two ago I would have said that CTEs were not widely implemented enough to rely on them, but as of 2018 I no longer think that is the case, and so we could choose to go that route as well. It certainly would be nice to not have a separate table. And it might be that, despite their complexity, we simply need to implement the CTE queries once as part of a new API that might include methods like getDescendants(pid), getFirstDescendant(pid), and getLastDescendant(pid), for example. So seems to me either approach could work.

Other pros and cons? Do we need new API methods like I described, or do we just need this version chain accessible from SOLR somehow and facetable? Let's discuss.

csjx commented 3 years ago

To move this discussion along a bit since this functionality is becoming even more important, I'll suggest a modified version Matt's proposal for a new API. I definitely think we could use an API independent of the Solr index, but we could also populate the index as well. Given the usability issues with names like ascendant, descendant, antecedent, ancestor, predecessor, successor, etc., I'm wondering about the following API (hopefully fairly intuitive):

Given an obsolescence chain with an ordered pid list of:

A, B, C, D, E, F, G

Listing versions

(for these examples, id = "D")

Getting a specific version

(for these examples, id = "D")

These methods could potentially be reduced to:

So, this warrants more thought. I'd love to hear what resonates or doesn't. I'm already wondering if subsequent should be next in these calls. 🤔

gothub commented 3 years ago

@chris the reductionist version of these look good to me:

It's reasonable to follow the method name convention of getXXX returning a single value and listXXX returning multiple values, as that is mostly the convention followed by the DataONE API, MNCore.getLogRecords() being an exception.

If this naming convention isn't followed, then these could be reduced to a single method, with the default being to return all values, if range and count not being specified.

mbjones commented 3 years ago

Thanks @csjx, this looks like a great start. I think what would be useful to evaluate it is to consider what use cases we want the API to solve with these APIs. Can we develop those out? Here are a few to get started:

Version info on landing pages

Version info for metadata-check analysis

PID versionGroup version formatId dateUpdated ...
P1 group1 1 text/csv 2020-01-01
P2 group1 2 text/csv 2020-02-01
P3 group1 3 text/csv 2020-03-01
P4 group2 1 text/csv 2019-06-01
P5 group2 2 text/csv 2019-07-01

Version info for metric analysis

csjx commented 3 years ago

Another use case:

Version info for cloning content programatically

@rushirajnenuji - Will you add in the Metrics use cases that @mbjones mentioned above?

amoeba commented 3 years ago

Since a version chain can be infinitely long, it seems to me like we'd want full pagination support. This'd mean having a start or offset argument to pair with count on listVersions. Essentially the same thing listObjects already has.

And here's a use case we currently have that could be improved by this API change:

Use case: Forwarding client requests to the latest version

Clients include MetacatUI, R, etc . Current implementations walk the version chain forward with repeated getSytsemMetadata calls and this API would reduce the number of requests the client needs to send from n to one.

Edit: I see this is basically https://github.com/NCEAS/metacatui/issues/1400.

rushirajnenuji commented 3 years ago

Version info for generating metrics.

amoeba commented 3 years ago

We discussed this on our Sep 10, 2020 dev call and continued in a break out afterwards for a bit. I'm commenting here to bring some of that discussion back here for visibility.

From looking at our use cases, we found we really have three categories:

  1. Clients that need information about one or more versions in an object series. e.g., given an object's PID, tell me newest version
  2. Clients that need a virtual identifier to identify, resolve, and detect membership within the entire series of versions for an object. This is kinda like a series ID, but would be 1:1 with the entire series, whereas an object version series can have more than one series ID. e.g., Metadig calculates scores for an entire object version series and needs an identifier to group by
  3. Clients that need a virtual identifier to identify and resolve all objects and versions within a collection of objects. e.g., Metadig & metrics for portals

(1) is different from (2) and (3) and (2) and (3) are alike to one another in that they're about the idea of virtual identifiers. (3) is complicated enough we might stick to trying to tackle just (1) and possibly (2) for now.

Where is this information available to clients?: None of the use cases are real-time enough that we strictly to make the information available directly from the database which means Solr could be the place where this information lives and we don't necessarily need to expose any new APIs here.

Is version information always public or restricted to those who can read? If we exposed an API like listVersions above, would it include in its response versions the client can't read? If not, handling the return value gets tricky because of the gaps. If we store a virtual identifier as another Solr field, object visibility results would apply by default.

What can we solve?

Another way to describe (1) is that clients are currently walking sysmeta via the obsoletes and obsoletedBy properties. This is slow, foremost, but is also limiting (as seen in https://github.com/NCEAS/metacatui/issues/1400) because that walking process stops if the client doesn't have access to a version in the middle of the chain.

Another way to describe (2) and (3) is that external services such as Metadig & Metrics are doing their own version chain crawl (use case group 1) and creating virtual identifiers for object series (use case group 2).

Steps forward

Ignoring how for now, we have two things we can do:

A. Implement a versions API B. Implement a virtual (or even a real) identifier for every object version series

(A) solves some of our problems and, IMO, is valuable in its own right whether it solves our problems or not. And it's the simpler/easier thing to do. (B) solves more/most of our problems but is a bit more involved though still tractable.

My vote at this point is for starting with a versions API (A) implemented with recursive CTEs. This avoids major changes to the database (which increases our time to implement and impacts all users when we/they upgrade), appears to be plenty performant, and can have its implementation swapped out at a later date if we implement some form of virtual identifier.

Please have a read and let me know if I've represented your case accurately, missed any glaring upsides/downsides, etc.

mbjones commented 1 month ago

@taojing2002 @artntek Let's please discuss this version chain API request wrt upcoming Metacat releases. If we take the CTE approach as @amoeba concludes, then this could be done with the existing data tables and may not take too much time. I'd like to use it for both the MetaDIG and Metrics services.