Closed silverdaz closed 6 years ago
About harmonizing the data-in and data-out database schema, there is a better approach: Database Views.
It allows us to have whatever database backend we want, with whatever schema we fancy, and then we use 2 views, one for data-in, one for data-out.
That way, there is no need to harmonize the code to have given table names, column names, etc... In other word: Those views are an interface to the database.
Here is an example:
CREATE TABLE local_ega (
id SERIAL, PRIMARY KEY(id), UNIQUE (id),
status status,
-- Original file
elixir_id TEXT NOT NULL,
inbox_path TEXT NOT NULL,
-- EGA file ids
stable_id TEXT,
-- Vault information
vault_path TEXT,
vault_filesize INTEGER,
vault_display_name TEXT, -- no idea what that is
-- Crypt4GH header
header TEXT,
-- Useless checksums when using Crypt4GH
checksum VARCHAR(32),
checksum_type checksum_algorithm,
unencrypted_checksum VARCHAR(32),
unencrypted_checksum_type checksum_algorithm,
-- Status
status status NOT NULL,
-- Audit / Logs
created_by TEXT,
last_modified_by TEXT,
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
last_modified TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);
-- WITH (OIDS=FALSE);
CREATE UNIQUE INDEX file_id_idx ON local_ega (id);
-- ################## VIEWS #########################
-- Used by data-in
CREATE VIEW files AS
SELECT id, elixir_id, inbox_path, status, vault_path, vault_filesize, stable_id, header, created_at, last_modified
FROM local_ega;
-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
stable_id AS file_name,
vault_path AS file_path,
display_file_name,
vault_filesize AS file_size,
checksum, checksum_type, unencrypted_checksum, unencrypted_checksum_type,
status AS file_status,
created_by, last_updated_by, header,
created_at AS created
last_modified AS last_updated
FROM local_ega;
Note that we can add a WHERE
-clause if we want to disallow accessing some files.
Or here is another idea:
Create a view for (local) EGA files and another view for BioBank files (using yet another where
-clause)
Related issue ? https://github.com/EGA-archive/ega-data-api/issues/10
ok, this seems to do it:
\connect lega
SET TIME ZONE 'UTC';
CREATE TYPE checksum_algorithm AS ENUM ('MD5', 'SHA256', 'SHA384', 'SHA512'); -- md5 is bad. Use sha*!
CREATE TYPE storage AS ENUM ('S3', 'POSIX');
-- Note: This is an enum, because that's what the "provided" database supports
-- If a site has its own database already, let them define their keyword in the ENUM
-- and use it (Notice that their code must be update to push this value into the table)
-- There is no need to agree on how each site should operate their own database
-- What we need is to document where they need to update and what.
-- ##################################################
-- FILE STATUS
-- ##################################################
CREATE TABLE status (
id INTEGER,
code VARCHAR(16) NOT NULL,
description TEXT,
-- contraints
PRIMARY KEY(id), UNIQUE (id), UNIQUE (code)
);
INSERT INTO status(id,code,description)
VALUES (10, 'INIT' , 'Initializing a file ingestion'),
(20, 'IN_INGESTION', 'Currently under ingestion'),
(30, 'ARCHIVED' , 'File moved to Vault'),
(40, 'COMPLETED' , 'File verified in Vault'),
(50, 'READY' , 'File ingested, ready for download'),
-- (60, 'IN_INDEXING', 'Currently under index creation'),
(0, 'ERROR' , 'An Error occured, check the error table')
;
-- ##################################################
-- ENCRYPTION FORMAT
-- ##################################################
CREATE TABLE vault_encryption (
mode VARCHAR(16) NOT NULL, PRIMARY KEY(mode), UNIQUE (mode),
description TEXT
);
INSERT INTO vault_encryption(mode,description)
VALUES ('CRYPT4GH' , 'Crypt4GH encryption (using version)'),
('PGP' , 'OpenPGP encryption (RFC 4880)'),
('AES' , 'AES encryption with passphrase'),
('CUSTOM1' , 'Custom method 1 for local site'),
('CUSTOM2' , 'Custom method 2 for local site')
-- ...
;
-- ##################################################
-- FILES
-- ##################################################
-- Main table with looooots of information
CREATE TABLE main (
id SERIAL, PRIMARY KEY(id), UNIQUE (id),
-- EGA file ids
stable_id TEXT,
-- Status
status VARCHAR NOT NULL REFERENCES status (code), -- No "ON DELETE CASCADE": update to the new status
-- in case the old one is deleted
-- Original/Submission file
submission_file_path TEXT NOT NULL,
submission_file_extension VARCHAR(260) NOT NULL,
submission_file_calculated_checksum VARCHAR(128),
submission_file_calculated_checksum_type checksum_algorithm,
submission_file_size INTEGER NULL,
submission_user TEXT NOT NULL, -- Elixir ID, or internal user
-- Vault information
vault_file_reference TEXT, -- file path if POSIX, object id if S3
vault_file_type storage, -- S3 or POSIX file system
vault_file_size INTEGER,
vault_file_checksum VARCHAR(128) NULL, -- NOT NULL,
vault_file_checksum_type checksum_algorithm,
-- Encryption/Decryption
encryption_method VARCHAR REFERENCES vault_encryption (mode), -- ON DELETE CASCADE,
version INTEGER , -- DEFAULT 1, -- Crypt4GH version
header TEXT, -- Crypt4GH header
session_key_checksum VARCHAR(128) NULL, -- NOT NULL, -- To check if session key already used
session_key_checksum_type checksum_algorithm,
-- Note: We can support multiple encryption. See at the end of that file.
-- Audit / Logs
created_by NAME DEFAULT CURRENT_USER, -- Postgres users
last_modified_by NAME DEFAULT CURRENT_USER, --
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
last_modified TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
)
WITH (
OIDS=FALSE
);
CREATE UNIQUE INDEX file_id_idx ON main(id);
-- ##################################################
-- ERRORS
-- ##################################################
CREATE TABLE main_errors (
id SERIAL, PRIMARY KEY(id), UNIQUE (id),
active BOOLEAN NOT NULL DEFAULT TRUE,
file_id INTEGER NOT NULL REFERENCES main(id) ON DELETE CASCADE,
hostname TEXT,
error_type TEXT NOT NULL,
msg TEXT NOT NULL,
from_user BOOLEAN DEFAULT FALSE,
occured_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);
-- ##################################################
-- Quality Control
-- ##################################################
-- Coming soon to a cinema near you
-- ##################################################
-- Views
-- ##################################################
-- Used by data-in
CREATE VIEW files AS
SELECT id,
submission_user AS elixir_id,
submission_file_path AS inbox_path,
submission_file_size AS inbox_filesize,
submission_file_calculated_checksum AS inbox_file_checksum,
submission_file_calculated_checksum_type AS inbox_file_checksum_type,
status,
vault_file_reference AS vault_path,
vault_file_type AS vault_type,
vault_file_size AS vault_filesize,
stable_id,
header, -- Crypt4gh specific
version,
session_key_checksum,
session_key_checksum_type,
created_at,
last_modified
FROM main;
-- Insert into main
CREATE FUNCTION insert_file(inpath main.submission_file_path%TYPE,
eid main.submission_user%TYPE)
RETURNS main.id%TYPE AS $insert_file$
#variable_conflict use_column
DECLARE
file_id main.id%TYPE;
file_ext main.submission_file_extension%TYPE;
BEGIN
file_ext := substring(inpath from '\.([^\.]*)$'); -- extract extension from filename
INSERT INTO main (submission_file_path,
submission_user,
submission_file_extension,
status,
encryption_method) -- hard-code the vault_encryption
VALUES(inpath,eid,file_ext,'INIT','CRYPT4GH') RETURNING main.id
INTO file_id;
RETURN file_id;
END;
$insert_file$ LANGUAGE plpgsql;
-- Just showing the current/active errors
CREATE VIEW errors AS
SELECT id, file_id, hostname, error_type, msg, from_user, occured_at
FROM main_errors
WHERE active = TRUE;
CREATE FUNCTION insert_error(fid errors.file_id%TYPE,
h errors.hostname%TYPE,
etype errors.error_type%TYPE,
msg errors.msg%TYPE,
from_user errors.from_user%TYPE)
RETURNS void AS $insert_error$
BEGIN
INSERT INTO errors (file_id,hostname,error_type,msg,from_user) VALUES(fid,h,etype,msg,from_user);
UPDATE files SET status = 'ERROR' WHERE id = fid;
END;
$insert_error$ LANGUAGE plpgsql;
-- When File becomes 'READY', remove all its errors from current errors.
CREATE FUNCTION mark_ready()
RETURNS TRIGGER AS $mark_ready$
BEGIN
UPDATE main_errors SET active = FALSE WHERE file_id = NEW.id; -- or OLD.id
RETURN NEW;
END;
$mark_ready$ LANGUAGE plpgsql;
CREATE TRIGGER mark_ready
AFTER UPDATE OF status ON main -- using the main and not files
-- because "Views cannot have row-level BEFORE or AFTER triggers."
FOR EACH ROW WHEN (NEW.status = 'READY')
EXECUTE PROCEDURE mark_ready();
-- #######################################################################
-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
stable_id AS file_name,
vault_file_reference AS file_path,
vault_file_type AS vault_type,
submission_file_path AS display_file_name,
vault_file_size AS file_size,
submission_file_calculated_checksum AS checksum,
submission_file_calculated_checksum_type AS checksum_type,
vault_file_checksum AS unencrypted_checksum,
vault_file_checksum_type AS unencrypted_checksum_type,
status AS file_status,
header,
created_by,
last_modified_by AS last_updated_by,
created_at AS created,
last_modified AS last_updated
FROM main;
-- ##########################################################################
-- About the encryption
-- ##########################################################################
--
--
-- We can support multiple encryption types in the vault
-- (Say, for example, Crypt4GH, PGP and plain AES),
-- in the following manner:
--
-- We create a table for each method of encryption.
-- Each table will have its own set of fields, refering to data it needs for decryption
--
-- Then we update the main file table with a vault_encryption "keyword".
-- That will tell the main file table to look at another table for that
-- particular file. (Note that this file reference is found in only one
-- encryption table).
--
-- At the moment, we added the whole Crypt4gh-related table inside the main table.
-- That's easily changeable, using views.
--
-- The site that wants to support multiple encryption methods must update the data-in and
-- data-out code in order to push/pull the data accordingly.
Compared with the fields descried in https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources and https://github.com/EGA-archive/ega-data-api/issues/10#issuecomment-407743725
At first glance the names seem to be ok ( the focus was on File.sql
) with an additional one above vault_type
.
The data out seems to operate within a certain schema: dev_ega_file
and it uses other tables such as downdload_log
and event
that are not present in the schema above.
Maybe a CREATE TABLE IF NOT EXISTS
would be nice either on data out or on the schema above.
@silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.
Next step would be to see if data out can utilize this schema, and if any changes required on that part.
@silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.
I updated the comment with the recent changes.
The data out seems to operate within a certain schema:
dev_ega_file
and it uses other tables such asdowndload_log
andevent
that are not present in the schema above.
I'm not aware of the download_log
tables, can you point me to the file loading that schema?
@silverdaz
I'm not aware of the
download_log
tables, can you point me to the file loading that schema?
There are multiple files used: https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources . For donwload_log
see: https://github.com/EGA-archive/ega-data-api/blob/master/ega-data-api-filedatabase/src/main/resources/DownloadLog.sql
Also refer to this comment for more information: https://github.com/EGA-archive/ega-data-api/issues/10#issuecomment-407743725
....so that ingestion and data-out play nicely with each other.
One database is enough for both.