CDLUC3 / dmphub

Simple metadata repository for networked DMPs
MIT License
3 stars 1 forks source link

Define DynamoDB schema #72

Closed briri closed 2 years ago

briri commented 2 years ago

Define the new schema for our DynamoDB

briri commented 2 years ago

Table name: dmphub

The most common use pattern will be: -- fetch the latest version --

Partition key: dmp_id     ------
                                 | --- Primary key (also the current version)
Sort key: latest_version  ------

local sort keys: modified RANGE (versions)
mariapraetzellis commented 2 years ago

Here's a list of some searches we'd want. I'm sure there will be additional needs, but this is a start anyhow.

briri commented 2 years ago

Some others that will assist our API functions:

briri commented 2 years ago

Based on the above needs and reading on best practices I have decided to go with a single table Polymorphic design.

Partition and Sort keys

The following outlines our Partition Key (PK) and Sort KEY (SK) for the table. This will provide fast reliable access for our primary use cases:

Some examples of these keys:

PK - PROVENANCE#dmptool,      SK - PROFILE                       <--- Provenance info
PK - PROVENANCE#dmptool,      SK - DMPS                          <--- Array of DMP PKs for the Provenance
PK - DMP#doi:10.48321/D1M30K, SK - VERSION#latest                <--- Latest version
PK - DMP#doi:10.48321/D1M30K, SK - VERSION#2022-02-18T12:30:25Z  <--- Historical version
PK - PERSON#[orcid],          SK - DMPS                          <--- Array of DMP PKs for the Person
PK - AFFILIATION#[id],        SK - DMPS                          <--- Array of DMP PKs for the Affiliation / Funder
PK - RELATED#[:id],           SK - DMPS                          <--- Array of DMP PKs for the Related Identifier / Grant

When a DMP is deleted, it is retained in the DB but it is marked as tombstoned, so that we can still service the DMP ID resolution with a response. These look like: PK - DMP#doi:10.12345/A1B2C3, SK - VERSION#tombstone

Item structure

The base table will house the following for our 2 types of items

Provenance

In the case of an external system, we will store some basic profile info: the Cognito user pool id, name, description, redirect_uri, etc. In the case of a user, we will store some basic profile info: the Cognito user pool id, name, email and ORCID

This info will be used to communicate funding and related identifier assertions made about their DMPs. It will also assist us with any future user or administrative dashboard development.

DMP

In the case of a DMP, we will store the RDA Common Standard JSON along with a few additional attributes specific to the DMPHub. These additional fields will not be included in any API calls, they are simply meant to facilitate the functionality of the system. These fields are:

dmphub_provenance_id                <-- the PK of the provenance (e.g. `PROVENANCE#12AB34C`)
dmphub_modification_day:            <-- a shortened version of the modification timestamp provided by provenance (e.g. `2022-04-28`)
dmphub_created_at:                  <-- timestamp of when the item was added
dmphub_updated_at:                  <-- timestamp of when the item was updated
dmphub_deleted_at:                  <-- timestamp of when the item was 'removed'
dmphub_narrative_location           <-- URI to the S3 location for the DMP document
dmphub_debug                        <-- will output more detailed info to the CloudWatch logs
dmphub_provenance_identifier        <-- the provenance system's original :dmp_id sent when creating the DMP

Local secondary indexes

Local secondary indexes (LSI) use the Partion key but allow you to specify a different sort key. I don't see a need for any at this point in time. The version SK is sufficient.

Global secondary indexes

Global secondary indexes (GSI) allow us to search the table from different angles. GSIs are updated in near real time as changes are made to the base table. They allow you to store a subset of the base table item's content to keep things small and to reduce the amount of write activity as the base item changes.

We will use the following initial GSIs:

modification_day_gsi               <-- fetch all DMPs for a time period (returns: DMP PKs, title, contact, dmphub_affiliation_ids) 
dmphub_provenance_identifier_gsi   <-- fetch the DMP by the provenance system's local id for the DMP

These GSIs return a subset of information that should be enough for the lambda or user to make further selections of DMP PKs that can then be used to fetch the latest version of that DMP

briri commented 2 years ago

Done with initial setup. Will create a separate issue to review the CF config more thoroughly