Define DynamoDB schema - Githubissues

briri commented 2 years ago

Define the new schema for our DynamoDB

[x] Review existing JSON stored in the DMPHub (e.g. https://dmphub.cdlib.org/dmps/doi:10.48321/D1CW23.json) and determine which attributes we should retain (e.g. we might not care about the :dmproadmap_template
[x] Review query needs based on existing and anticipated use cases (e.g. all DMPs for a specific research field station)
[x] Define the partition key and local secondary indices
[x] Define any global secondary indices

briri commented 2 years ago

Table name: dmphub

The most common use pattern will be: -- fetch the latest version --

Partition key: dmp_id     ------
                                 | --- Primary key (also the current version)
Sort key: latest_version  ------

local sort keys: modified RANGE (versions)

mariapraetzellis commented 2 years ago

Here's a list of some searches we'd want. I'm sure there will be additional needs, but this is a start anyhow.

[x] A list of data deposits and publications
[x] Individuals and their contact info
[x] List of all funded projects; list of funded projects by the funder
[x] All searches should include grant start/end dates & grant IDs & title & PI name & funder

briri commented 2 years ago

Some others that will assist our API functions:

[x] Give me all the DMPs that were modified between X and Y dates
[x] Give me the most recent version for a DMP ID
[x] Give me all the versions of a DMP ID
[x] Give me all the DMPs created by X (where X can be a user (via the new manual upload form) or a system like DMPTool)
[x] Find a DMP by an external system's identifier (when trying to prevent duplicates), for example https://dmptool.org/plans/12345.pdf)

briri commented 2 years ago

Based on the above needs and reading on best practices I have decided to go with a single table Polymorphic design.

Partition and Sort keys

The following outlines our Partition Key (PK) and Sort KEY (SK) for the table. This will provide fast reliable access for our primary use cases:

Resolving DMP ID requests to JSON or an HTML landing page (fetch the latest version of the DMP)
Versioning support: the latest version will always be tagged 'latest' and prior versions with the timestamp. The Lambdas will make use of a Ruby gem that helps manage the 'latest' version indicator
Fetching information about provenance (e.g. DMPTool or Jane Doe (who created her DMP via the upload form)

Some examples of these keys:

PK - PROVENANCE#dmptool,      SK - PROFILE                       <--- Provenance info
PK - PROVENANCE#dmptool,      SK - DMPS                          <--- Array of DMP PKs for the Provenance
PK - DMP#doi:10.48321/D1M30K, SK - VERSION#latest                <--- Latest version
PK - DMP#doi:10.48321/D1M30K, SK - VERSION#2022-02-18T12:30:25Z  <--- Historical version
PK - PERSON#[orcid],          SK - DMPS                          <--- Array of DMP PKs for the Person
PK - AFFILIATION#[id],        SK - DMPS                          <--- Array of DMP PKs for the Affiliation / Funder
PK - RELATED#[:id],           SK - DMPS                          <--- Array of DMP PKs for the Related Identifier / Grant

When a DMP is deleted, it is retained in the DB but it is marked as tombstoned, so that we can still service the DMP ID resolution with a response. These look like: PK - DMP#doi:10.12345/A1B2C3, SK - VERSION#tombstone

Item structure

The base table will house the following for our 2 types of items

Provenance

In the case of an external system, we will store some basic profile info: the Cognito user pool id, name, description, redirect_uri, etc. In the case of a user, we will store some basic profile info: the Cognito user pool id, name, email and ORCID

This info will be used to communicate funding and related identifier assertions made about their DMPs. It will also assist us with any future user or administrative dashboard development.

DMP

In the case of a DMP, we will store the RDA Common Standard JSON along with a few additional attributes specific to the DMPHub. These additional fields will not be included in any API calls, they are simply meant to facilitate the functionality of the system. These fields are:

dmphub_provenance_id                <-- the PK of the provenance (e.g. `PROVENANCE#12AB34C`)
dmphub_modification_day:            <-- a shortened version of the modification timestamp provided by provenance (e.g. `2022-04-28`)
dmphub_created_at:                  <-- timestamp of when the item was added
dmphub_updated_at:                  <-- timestamp of when the item was updated
dmphub_deleted_at:                  <-- timestamp of when the item was 'removed'
dmphub_narrative_location           <-- URI to the S3 location for the DMP document
dmphub_debug                        <-- will output more detailed info to the CloudWatch logs
dmphub_provenance_identifier        <-- the provenance system's original :dmp_id sent when creating the DMP

Local secondary indexes

Local secondary indexes (LSI) use the Partion key but allow you to specify a different sort key. I don't see a need for any at this point in time. The version SK is sufficient.

Global secondary indexes

Global secondary indexes (GSI) allow us to search the table from different angles. GSIs are updated in near real time as changes are made to the base table. They allow you to store a subset of the base table item's content to keep things small and to reduce the amount of write activity as the base item changes.

We will use the following initial GSIs:

modification_day_gsi               <-- fetch all DMPs for a time period (returns: DMP PKs, title, contact, dmphub_affiliation_ids) 
dmphub_provenance_identifier_gsi   <-- fetch the DMP by the provenance system's local id for the DMP

These GSIs return a subset of information that should be enough for the lambda or user to make further selections of DMP PKs that can then be used to fetch the latest version of that DMP

briri commented 2 years ago

Done with initial setup. Will create a separate issue to review the CF config more thoroughly

CDLUC3 / dmphub

Define DynamoDB schema #72

Partition and Sort keys

Item structure

Provenance

DMP

Local secondary indexes

Global secondary indexes