Open mbjones opened 4 years ago
Interesting! Thank you for introducing this concept. I can see this getting utilized for the D1 Metrics Service, where we traverse back in the version chain to calculate the metrics.
Really cool, @mbjones, I hadn't heard of a closure table before. I think having one or more O(1) APIs around version chains would be very helpful across current and future projects.
Another approach I implemented recently and mostly like are recursive common table expressions (CTE). My understanding is that they're O(n) but the O(n) happens inside the database engine so queries for even large hierarchies return in <1ms, as opposed to O(n) via HTTPS which is very expensive by comparison.
The possible benefit I see to CTEs over closure tables is that they don't require a second table and work on existing implicit hierarchies in your tables. As for downsides (from a quick web search) their performance seems to be on the same magnitude as closure tables but a bit slower.
Here's an example:
Given a table like our systemmetadata
table in Metacat's database representing a chain from PIDs a->b->c:
guid | obsoleted_by | obsoletes |
---|---|---|
a | b | NULL |
b | c | a |
c | NULL | b |
We can query for 'a's descendants (note that the string 'a' is embedded in the query below because we're querying for 'a's descendants):
WITH RECURSIVE
children(x) AS (
VALUES ('a')
UNION
SELECT guid FROM sysmeta, children
WHERE sysmeta.obsoletes = children.x
)
SELECT guid FROM sysmeta
WHERE sysmeta.guid IN (select * from children);
guid |
---|
a |
b |
c |
A recursive CTE is really effectively asking the query engine to walk from row to row, forming unions with matched rows as it goes. Or at least that's my basic understanding. Food for thought.
Thanks @amoeba for the thoughts on CTEs. In my post when I referred to "support for hierarchical queries", I was referring to CTEs, which were introduced in SQL:99 and are described on slide 17 of Karwin's slideset that I linked. Historically, each RDBMS has implemented the syntax and algorithm for CTEs in a proprietary way, even though they are all very similar. Postgres 9.5 first provided the SQL:99 version of WITH, whereas Oracle used to use CONNECT BY, and now implements both. Mysql also recently implemented them in their 2018 8.0.11 release. So it seems CTEs are widespread enough now to be used across platforms.
The tradeoff is this: CTE-based queries are much more complex to write; while closure tables are more complicated to maintain (probably involving triggers). But using closure tables, we can use really fast and simple queries to get complex results. Here's a quick example of some queries we would probably want to support via API.
create table version_closure (
ancestor text NOT NULL,
descendant text NOT NULL,
depth integer,
PRIMARY KEY (ancestor, descendant)
);
INSERT INTO version_closure VALUES
('P1', 'P1', 0),
('P2', 'P2', 0),
('P3', 'P3', 0),
('P1', 'P2', 1),
('P2', 'P3', 1),
('P1', 'P3', 2);
All descendants (analogous to your CTE query above)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY DEPTH;
Immediate descendants (but not grandchildren) (i.e., obsoletedBy)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' AND depth = 1;
Latest descendant (i.e., latest version HEAD)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY depth DESC LIMIT 1;
All ancestors (i.e., full version chain ordered newest to oldest)
SELECT ancestor FROM version_closure WHERE descendant = 'P3' ORDER BY depth;
So it's that simplicity that I like. But it comes at the expense of having to maintain the closure table. Even a year or two ago I would have said that CTEs were not widely implemented enough to rely on them, but as of 2018 I no longer think that is the case, and so we could choose to go that route as well. It certainly would be nice to not have a separate table. And it might be that, despite their complexity, we simply need to implement the CTE queries once as part of a new API that might include methods like getDescendants(pid)
, getFirstDescendant(pid)
, and getLastDescendant(pid)
, for example. So seems to me either approach could work.
Other pros and cons? Do we need new API methods like I described, or do we just need this version chain accessible from SOLR somehow and facetable? Let's discuss.
To move this discussion along a bit since this functionality is becoming even more important, I'll suggest a modified version Matt's proposal for a new API. I definitely think we could use an API independent of the Solr index, but we could also populate the index as well. Given the usability issues with names like ascendant
, descendant
, antecedent
, ancestor
, predecessor
, successor
, etc., I'm wondering about the following API (hopefully fairly intuitive):
Given an obsolescence chain with an ordered pid
list of:
A, B, C, D, E, F, G
(for these examples,
id = "D"
)
listVersions(id) : VersionList
- List all versions (antecedents
and descendants
) given the id
(pid
or sid
), where VersionList
looks something like this in XML:
<versions referenceId="D">
<identifier>A</identifier>
<identifier>B</identifier>
<identifier>C</identifier>
<identifier>D</identifier>
<identifier>E</identifier>
<identifier>F</identifier>
<identifier>G</identifier>
</versions>
which could also be represented in JSON
as something like:
{
"referenceId": "D"
"versions": ["A", "B", "C", "D", "E", "F", "G"]
}
listPriorVersions(id) : VersionList
- List all prior versions (antecedents
) of the given id
<versions referenceId="D">
<identifier>A</identifier>
<identifier>B</identifier>
<identifier>C</identifier>
</versions>
listSubsequentVersions(id) : VersionList
- List all subsequent versions (antecedents
) of the given id
<versions referenceId="D">
<identifier>E</identifier>
<identifier>F</identifier>
<identifier>G</identifier>
</versions>
These methods could potentially be reduced to a single call like:
listVersions(id, range="prior|subsequent") : VersionList
or listVersions(id, range="prior|subsequent", count=5) : VersionList
(count
could limit how many are returned prior to, subsequent to, or on both sides of id
)(for these examples,
id = "D"
)
getFirstVersion(id) : Identifier
- Get the first version (antecedent
) given the id
(pid
or sid
)
<identifier>A</identifier> # this would be a DataONE Types.Identifier
getPriorVersion(id) : Identifier
- Get the prior version (antecedent
) given the id
(pid
or sid
)
<identifier>C</identifier>
getSubsequentVersion(id) : Identifier
- Get the subsequent version (descendant
) given the id
(pid
or sid
)
<identifier>E</identifier>
getLastVersion(id) : Identifier
- Get the last version(descendant
) given the id
(pid
or sid
)
<identifier>G</identifier>
These methods could potentially be reduced to:
getVersion(id, position="first|prior|subsequent|last") : Identifier
So, this warrants more thought. I'd love to hear what resonates or doesn't. I'm already wondering if subsequent
should be next
in these calls. 🤔
@chris the reductionist version of these look good to me:
listVersions(id, range="prior|subsequent", count=5) : VersionList
getVersion(id, position="first|prior|subsequent|last") : Identifier
It's reasonable to follow the method name convention of getXXX
returning a single value and listXXX
returning multiple values, as that is mostly the convention followed by the DataONE API, MNCore.getLogRecords()
being an exception.
If this naming convention isn't followed, then these could be reduced to a single method, with the default being to return all values, if range
and count
not being specified.
listVersions(id)
: returns entire chainThanks @csjx, this looks like a great start. I think what would be useful to evaluate it is to consider what use cases we want the API to solve with these APIs. Can we develop those out? Here are a few to get started:
listVersions(id)
on each landing page loadPID | versionGroup | version | formatId | dateUpdated | ... |
---|---|---|---|---|---|
P1 | group1 | 1 | text/csv | 2020-01-01 | |
P2 | group1 | 2 | text/csv | 2020-02-01 | |
P3 | group1 | 3 | text/csv | 2020-03-01 | |
P4 | group2 | 1 | text/csv | 2019-06-01 | |
P5 | group2 | 2 | text/csv | 2019-07-01 |
Another use case:
Using clients in Java, R, and Python, we have needed to clone one or more objects from one member node to another, along with all prior versions. In some cases, we need to migrate all content from one member node to another.
This usually involves repeated calls to MN.listObjects()
. For each object, we need to get the first version, and have traditionally traversed down the version chain with repeated calls to MN.getSystemMetadata(pid)
to read the obsoletes
property.
In this use case, we would make one call to MN.getVersion(pid, "first")
(or MN.listVersions(pid, "first")
if we decide on a single API call). We then are able to start migrating the objects with calls to MN.get(pid)
and MN.getSystemMetadata(pid)
.
@rushirajnenuji - Will you add in the Metrics use cases that @mbjones mentioned above?
Since a version chain can be infinitely long, it seems to me like we'd want full pagination support. This'd mean having a start
or offset
argument to pair with count
on listVersions
. Essentially the same thing listObjects
already has.
And here's a use case we currently have that could be improved by this API change:
Use case: Forwarding client requests to the latest version
Clients include MetacatUI, R, etc . Current implementations walk the version chain forward with repeated getSytsemMetadata
calls and this API would reduce the number of requests the client needs to send from n to one.
Edit: I see this is basically https://github.com/NCEAS/metacatui/issues/1400.
Version info for generating metrics.
Use case 1: For the Metrics Service, we'll want a list of identifiers for the current version as well as all the identifiers from the previous versions of that dataset. We then aggregate the total counts of read
events across this list to generate metrics.
Use case 2: DataONE aggregated metrics: portal metrics and user profile metrics. For this, we follow the same procedure as Usecase 1, but for the entire collection of datasets.
We discussed this on our Sep 10, 2020 dev call and continued in a break out afterwards for a bit. I'm commenting here to bring some of that discussion back here for visibility.
From looking at our use cases, we found we really have three categories:
(1) is different from (2) and (3) and (2) and (3) are alike to one another in that they're about the idea of virtual identifiers. (3) is complicated enough we might stick to trying to tackle just (1) and possibly (2) for now.
Where is this information available to clients?: None of the use cases are real-time enough that we strictly to make the information available directly from the database which means Solr could be the place where this information lives and we don't necessarily need to expose any new APIs here.
Is version information always public or restricted to those who can read
? If we exposed an API like listVersions
above, would it include in its response versions the client can't read? If not, handling the return value gets tricky because of the gaps. If we store a virtual identifier as another Solr field, object visibility results would apply by default.
What can we solve?
Another way to describe (1) is that clients are currently walking sysmeta via the obsoletes
and obsoletedBy
properties. This is slow, foremost, but is also limiting (as seen in https://github.com/NCEAS/metacatui/issues/1400) because that walking process stops if the client doesn't have access to a version in the middle of the chain.
listVersions
solves this because clients can make the additional request to listVersions
to get the info it needs.Another way to describe (2) and (3) is that external services such as Metadig & Metrics are doing their own version chain crawl (use case group 1) and creating virtual identifiers for object series (use case group 2).
listVersions
helps them with crawling version chains but they still need to generate virtual identifiersSteps forward
Ignoring how for now, we have two things we can do:
A. Implement a versions API B. Implement a virtual (or even a real) identifier for every object version series
(A) solves some of our problems and, IMO, is valuable in its own right whether it solves our problems or not. And it's the simpler/easier thing to do. (B) solves more/most of our problems but is a bit more involved though still tractable.
My vote at this point is for starting with a versions API (A) implemented with recursive CTEs. This avoids major changes to the database (which increases our time to implement and impacts all users when we/they upgrade), appears to be plenty performant, and can have its implementation swapped out at a later date if we implement some form of virtual identifier.
Please have a read and let me know if I've represented your case accurately, missed any glaring upsides/downsides, etc.
@taojing2002 @artntek Let's please discuss this version chain API request wrt upcoming Metacat releases. If we take the CTE approach as @amoeba concludes, then this could be done with the existing data tables and may not take too much time. I'd like to use it for both the MetaDIG and Metrics services.
Our current approach to linking versions of objects in DataONE and Metacat is to provide a pointer in the SystemMetadata for each object that points at the objects that it
obsoletes
and those that areobsoletedBy
it. This represents a doubly-linked list that can be traversed to reconstruct version information. Theobsoletes
andobsoletedBy
fields are in the Metacat postgres database as columns in the system_metadata table, with the abbreviated structure:guid, obsoletes, obsoleted_by
From a hierarchical data modeling perspective, this structure is called an adjacency list, and allows one to trace the version chain through a series of queries, each of which walks the version chain either forwards or backwards. Without support for hierarchical queries (which are proprietary and perform poorly), this generally means issuing
n-1
queries, wheren
is the length of the version chain.Much has been written about querying hierarchical data. Some good background reading is:
From the first set of slides, and other reading, I have concluded that closure tables are a much more efficient and fast way to store and query this hierarchical data. It involves creating a new table for just the hierarchical information, and creating one row for each link in the version chain, including links at each level of the hierarchy. This table structure allows a single query to retrieve all parent and child information in the tree, and so is extremely fast, and only marginally more expensive in terms of storage. The articles linked above explain much more thoroughly than I will here, but the essences is that the closure table contains the following columns:
parent, child, depth
A simple query to get all of the versions for a given object would be:
We'll need to develop out a more mature design, and an API to go along with this, but I think closure tables will allow us to provide efficient version information in our services with a new API. Below I attach my notes on the design and use of closure tables for reference.
closure-tables-notes.pdf
Comments and expansion of ideas appreciated. @csjx, @amoeba, @gothub, @taojing2002 @laurenwalker