NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data

https://knb.ecoinformatics.org/software/metacat

GNU General Public License v2.0

25 stars 12 forks source link

use closure table for version chain #1429

Open mbjones opened 4 years ago

mbjones commented 4 years ago

Our current approach to linking versions of objects in DataONE and Metacat is to provide a pointer in the SystemMetadata for each object that points at the objects that it obsoletes and those that are obsoletedBy it. This represents a doubly-linked list that can be traversed to reconstruct version information. The obsoletes and obsoletedBy fields are in the Metacat postgres database as columns in the system_metadata table, with the abbreviated structure:

guid, obsoletes, obsoleted_by

From a hierarchical data modeling perspective, this structure is called an adjacency list, and allows one to trace the version chain through a series of queries, each of which walks the version chain either forwards or backwards. Without support for hierarchical queries (which are proprietary and perform poorly), this generally means issuing n-1 queries, where n is the length of the version chain.

Much has been written about querying hierarchical data. Some good background reading is:

From the first set of slides, and other reading, I have concluded that closure tables are a much more efficient and fast way to store and query this hierarchical data. It involves creating a new table for just the hierarchical information, and creating one row for each link in the version chain, including links at each level of the hierarchy. This table structure allows a single query to retrieve all parent and child information in the tree, and so is extremely fast, and only marginally more expensive in terms of storage. The articles linked above explain much more thoroughly than I will here, but the essences is that the closure table contains the following columns:

parent, child, depth

A simple query to get all of the versions for a given object would be:

select * from closure_table where parent = 'P1';

We'll need to develop out a more mature design, and an API to go along with this, but I think closure tables will allow us to provide efficient version information in our services with a new API. Below I attach my notes on the design and use of closure tables for reference.

closure-tables-notes.pdf

Comments and expansion of ideas appreciated. @csjx, @amoeba, @gothub, @taojing2002 @laurenwalker

rushirajnenuji commented 4 years ago

Interesting! Thank you for introducing this concept. I can see this getting utilized for the D1 Metrics Service, where we traverse back in the version chain to calculate the metrics.

amoeba commented 4 years ago

Really cool, @mbjones, I hadn't heard of a closure table before. I think having one or more O(1) APIs around version chains would be very helpful across current and future projects.

Another approach I implemented recently and mostly like are recursive common table expressions (CTE). My understanding is that they're O(n) but the O(n) happens inside the database engine so queries for even large hierarchies return in <1ms, as opposed to O(n) via HTTPS which is very expensive by comparison.

The possible benefit I see to CTEs over closure tables is that they don't require a second table and work on existing implicit hierarchies in your tables. As for downsides (from a quick web search) their performance seems to be on the same magnitude as closure tables but a bit slower.

Here's an example:

Given a table like our systemmetadata table in Metacat's database representing a chain from PIDs a->b->c:

guid	obsoleted_by	obsoletes
a	b	NULL
b	c	a
c	NULL	b

We can query for 'a's descendants (note that the string 'a' is embedded in the query below because we're querying for 'a's descendants):

WITH RECURSIVE
  children(x) AS (
    VALUES ('a')
  UNION
    SELECT guid FROM sysmeta, children
    WHERE sysmeta.obsoletes = children.x
  )
  SELECT guid FROM sysmeta
  WHERE sysmeta.guid IN (select * from children);

guid
a
b
c

A recursive CTE is really effectively asking the query engine to walk from row to row, forming unions with matched rows as it goes. Or at least that's my basic understanding. Food for thought.

mbjones commented 4 years ago

Thanks @amoeba for the thoughts on CTEs. In my post when I referred to "support for hierarchical queries", I was referring to CTEs, which were introduced in SQL:99 and are described on slide 17 of Karwin's slideset that I linked. Historically, each RDBMS has implemented the syntax and algorithm for CTEs in a proprietary way, even though they are all very similar. Postgres 9.5 first provided the SQL:99 version of WITH, whereas Oracle used to use CONNECT BY, and now implements both. Mysql also recently implemented them in their 2018 8.0.11 release. So it seems CTEs are widespread enough now to be used across platforms.

The tradeoff is this: CTE-based queries are much more complex to write; while closure tables are more complicated to maintain (probably involving triggers). But using closure tables, we can use really fast and simple queries to get complex results. Here's a quick example of some queries we would probably want to support via API.

create table version_closure (
    ancestor text NOT NULL, 
    descendant text NOT NULL, 
    depth integer, 
    PRIMARY KEY (ancestor, descendant)
);
INSERT INTO version_closure VALUES
    ('P1', 'P1', 0),
    ('P2', 'P2', 0),
    ('P3', 'P3', 0),
    ('P1', 'P2', 1),
    ('P2', 'P3', 1),
    ('P1', 'P3', 2);

All descendants (analogous to your CTE query above)

SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY DEPTH;

Immediate descendants (but not grandchildren) (i.e., obsoletedBy)

SELECT descendant FROM version_closure WHERE ancestor = 'P1' AND depth = 1;

Latest descendant (i.e., latest version HEAD)

SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY depth DESC LIMIT 1;

All ancestors (i.e., full version chain ordered newest to oldest)

SELECT ancestor FROM version_closure WHERE descendant = 'P3' ORDER BY depth;

So it's that simplicity that I like. But it comes at the expense of having to maintain the closure table. Even a year or two ago I would have said that CTEs were not widely implemented enough to rely on them, but as of 2018 I no longer think that is the case, and so we could choose to go that route as well. It certainly would be nice to not have a separate table. And it might be that, despite their complexity, we simply need to implement the CTE queries once as part of a new API that might include methods like getDescendants(pid), getFirstDescendant(pid), and getLastDescendant(pid), for example. So seems to me either approach could work.

Other pros and cons? Do we need new API methods like I described, or do we just need this version chain accessible from SOLR somehow and facetable? Let's discuss.

csjx commented 3 years ago

To move this discussion along a bit since this functionality is becoming even more important, I'll suggest a modified version Matt's proposal for a new API. I definitely think we could use an API independent of the Solr index, but we could also populate the index as well. Given the usability issues with names like ascendant, descendant, antecedent, ancestor, predecessor, successor, etc., I'm wondering about the following API (hopefully fairly intuitive):

Given an obsolescence chain with an ordered pid list of:

A, B, C, D, E, F, G

Listing versions

(for these examples, id = "D")

listVersions(id) : VersionList - List all versions (antecedents and descendants) given the id (pid or sid), where VersionList looks something like this in XML:

<versions referenceId="D">
<identifier>A</identifier>
<identifier>B</identifier>
<identifier>C</identifier>
<identifier>D</identifier>
<identifier>E</identifier>
<identifier>F</identifier>
<identifier>G</identifier>
</versions>

which could also be represented in JSON as something like:

{
"referenceId": "D"
"versions": ["A", "B", "C", "D", "E", "F", "G"]
}

listPriorVersions(id) : VersionList - List all prior versions (antecedents) of the given id

<versions referenceId="D">
<identifier>A</identifier>
<identifier>B</identifier>
<identifier>C</identifier>
</versions>

listSubsequentVersions(id) : VersionList - List all subsequent versions (antecedents) of the given id
```
<versions referenceId="D">
<identifier>E</identifier>
<identifier>F</identifier>
<identifier>G</identifier>
</versions>
```
These methods could potentially be reduced to a single call like:
listVersions(id, range="prior|subsequent") : VersionList or
listVersions(id, range="prior|subsequent", count=5) : VersionList (count could limit how many are returned prior to, subsequent to, or on both sides of id)

Getting a specific version

(for these examples, id = "D")

getFirstVersion(id) : Identifier - Get the first version (antecedent) given the id (pid or sid)
```
<identifier>A</identifier> # this would be a DataONE Types.Identifier
```
getPriorVersion(id) : Identifier - Get the prior version (antecedent) given the id (pid or sid)
```
<identifier>C</identifier>
```
getSubsequentVersion(id) : Identifier - Get the subsequent version (descendant) given the id (pid or sid)
```
<identifier>E</identifier>
```
getLastVersion(id) : Identifier - Get the last version(descendant) given the id (pid or sid)
```
<identifier>G</identifier>
```

These methods could potentially be reduced to:

getVersion(id, position="first|prior|subsequent|last") : Identifier

So, this warrants more thought. I'd love to hear what resonates or doesn't. I'm already wondering if subsequent should be next in these calls. 🤔

gothub commented 3 years ago

@chris the reductionist version of these look good to me:

listVersions(id, range="prior|subsequent", count=5) : VersionList
getVersion(id, position="first|prior|subsequent|last") : Identifier

It's reasonable to follow the method name convention of getXXX returning a single value and listXXX returning multiple values, as that is mostly the convention followed by the DataONE API, MNCore.getLogRecords() being an exception.

If this naming convention isn't followed, then these could be reduced to a single method, with the default being to return all values, if range and count not being specified.

listVersions(id): returns entire chain
`listVersions(id, range="prior", count=1): returns 1 previous id
`listVersions(id, range="first"): returns first
...

mbjones commented 3 years ago

Thanks @csjx, this looks like a great start. I think what would be useful to evaluate it is to consider what use cases we want the API to solve with these APIs. Can we develop those out? Here are a few to get started:

Version info on landing pages

Display a dropdown on landing pages with all versions listed to navigate to older or newer versions
One request to listVersions(id) on each landing page load
Primary client: MetacatUI

Version info for metadata-check analysis

Need full version chain information for all dataset versions in DataONE, allowing us to analyze changes in checks across versions and time for specific collections (either all of DataONE, or for sub-collections associated with repositories or portals or users. This generally gets used 1) when a new collection is created, the collection-wide check graphs would be created, or 2) the graphs are updated once every 24 hours
currently, the analyses provide two types of version info: 1) an indicator of the version chain that a dataset is in; 2) and an ordered serial version number within that group (e.g., version=[1,2,3,...])
Ideally we want the version information to be in a collection-specific data source that is linked to each PID and is correlated to the check-assessment data; in our analytical scripts, we access this data in tabular form for all PIDs in the collection

PID	versionGroup	version	formatId	dateUpdated
P1	group1	1	text/csv	2020-01-01
P2	group1	2	text/csv	2020-02-01
P3	group1	3	text/csv	2020-03-01
P4	group2	1	text/csv	2019-06-01
P5	group2	2	text/csv	2019-07-01

I'm not sure we would use the API described above at all for this use case, but rather want the data accessible in SOLR or another integrated data source along with other classifiers for a PID that are used to facet sub-collections
Primary client: MetaDIG graph service

Version info for metric analysis

Same as previous, but for the metrics service; need to determine if and how it differs
Primary client: Metrics service

csjx commented 3 years ago

Another use case:

Version info for cloning content programatically

Using clients in Java, R, and Python, we have needed to clone one or more objects from one member node to another, along with all prior versions. In some cases, we need to migrate all content from one member node to another.
This usually involves repeated calls to MN.listObjects(). For each object, we need to get the first version, and have traditionally traversed down the version chain with repeated calls to MN.getSystemMetadata(pid) to read the obsoletes property.
In this use case, we would make one call to MN.getVersion(pid, "first") (or MN.listVersions(pid, "first") if we decide on a single API call). We then are able to start migrating the objects with calls to MN.get(pid) and MN.getSystemMetadata(pid).

@rushirajnenuji - Will you add in the Metrics use cases that @mbjones mentioned above?

amoeba commented 3 years ago

Since a version chain can be infinitely long, it seems to me like we'd want full pagination support. This'd mean having a start or offset argument to pair with count on listVersions. Essentially the same thing listObjects already has.

And here's a use case we currently have that could be improved by this API change:

Use case: Forwarding client requests to the latest version

Clients include MetacatUI, R, etc . Current implementations walk the version chain forward with repeated getSytsemMetadata calls and this API would reduce the number of requests the client needs to send from n to one.

Edit: I see this is basically https://github.com/NCEAS/metacatui/issues/1400.

rushirajnenuji commented 3 years ago

Version info for generating metrics.

Use case 1: For the Metrics Service, we'll want a list of identifiers for the current version as well as all the identifiers from the previous versions of that dataset. We then aggregate the total counts of read events across this list to generate metrics.
Use case 2: DataONE aggregated metrics: portal metrics and user profile metrics. For this, we follow the same procedure as Usecase 1, but for the entire collection of datasets.

amoeba commented 3 years ago

We discussed this on our Sep 10, 2020 dev call and continued in a break out afterwards for a bit. I'm commenting here to bring some of that discussion back here for visibility.

From looking at our use cases, we found we really have three categories:

Clients that need information about one or more versions in an object series. e.g., given an object's PID, tell me newest version
Clients that need a virtual identifier to identify, resolve, and detect membership within the entire series of versions for an object. This is kinda like a series ID, but would be 1:1 with the entire series, whereas an object version series can have more than one series ID. e.g., Metadig calculates scores for an entire object version series and needs an identifier to group by
Clients that need a virtual identifier to identify and resolve all objects and versions within a collection of objects. e.g., Metadig & metrics for portals

(1) is different from (2) and (3) and (2) and (3) are alike to one another in that they're about the idea of virtual identifiers. (3) is complicated enough we might stick to trying to tackle just (1) and possibly (2) for now.

Where is this information available to clients?: None of the use cases are real-time enough that we strictly to make the information available directly from the database which means Solr could be the place where this information lives and we don't necessarily need to expose any new APIs here.

Is version information always public or restricted to those who can read? If we exposed an API like listVersions above, would it include in its response versions the client can't read? If not, handling the return value gets tricky because of the gaps. If we store a virtual identifier as another Solr field, object visibility results would apply by default.

What can we solve?

Another way to describe (1) is that clients are currently walking sysmeta via the obsoletes and obsoletedBy properties. This is slow, foremost, but is also limiting (as seen in https://github.com/NCEAS/metacatui/issues/1400) because that walking process stops if the client doesn't have access to a version in the middle of the chain.

An API like listVersions solves this because clients can make the additional request to listVersions to get the info it needs.
A virtual identifier also solves this because the client can query the Solr index for other objects with the same virtual identifier

Another way to describe (2) and (3) is that external services such as Metadig & Metrics are doing their own version chain crawl (use case group 1) and creating virtual identifiers for object series (use case group 2).

An API like listVersions helps them with crawling version chains but they still need to generate virtual identifiers
Having a virtual identifier as a first-class identifier in DataONE means all clients can use this instead of re-creating their own. This might lead to better integration across services.

Steps forward

Ignoring how for now, we have two things we can do:

A. Implement a versions API B. Implement a virtual (or even a real) identifier for every object version series

(A) solves some of our problems and, IMO, is valuable in its own right whether it solves our problems or not. And it's the simpler/easier thing to do. (B) solves more/most of our problems but is a bit more involved though still tractable.

My vote at this point is for starting with a versions API (A) implemented with recursive CTEs. This avoids major changes to the database (which increases our time to implement and impacts all users when we/they upgrade), appears to be plenty performant, and can have its implementation swapped out at a later date if we implement some form of virtual identifier.

Please have a read and let me know if I've represented your case accurately, missed any glaring upsides/downsides, etc.

mbjones commented 1 month ago

@taojing2002 @artntek Let's please discuss this version chain API request wrt upcoming Metacat releases. If we take the CTE approach as @amoeba concludes, then this could be done with the existing data tables and may not take too much time. I'd like to use it for both the MetaDIG and Metrics services.