NBISweden / LocalEGA

Please go to to https://github.com/EGA-archive/LocalEGA instead
Apache License 2.0
4 stars 1 forks source link

Update the LocalEGA database schema #342

Closed silverdaz closed 6 years ago

silverdaz commented 6 years ago

....so that ingestion and data-out play nicely with each other.

One database is enough for both.

silverdaz commented 6 years ago

About harmonizing the data-in and data-out database schema, there is a better approach: Database Views.

It allows us to have whatever database backend we want, with whatever schema we fancy, and then we use 2 views, one for data-in, one for data-out.

That way, there is no need to harmonize the code to have given table names, column names, etc... In other word: Those views are an interface to the database.

Here is an example:

CREATE TABLE local_ega (
       id                        SERIAL, PRIMARY KEY(id), UNIQUE (id),
       status                    status,
       -- Original file
       elixir_id                 TEXT NOT NULL,
       inbox_path                TEXT NOT NULL,
       -- EGA file ids
       stable_id                 TEXT,
       -- Vault information
       vault_path                TEXT,
       vault_filesize            INTEGER,
       vault_display_name        TEXT,    -- no idea what that is
       -- Crypt4GH header
       header                    TEXT,
       -- Useless checksums when using Crypt4GH
       checksum                  VARCHAR(32),
       checksum_type             checksum_algorithm,
       unencrypted_checksum      VARCHAR(32),
       unencrypted_checksum_type checksum_algorithm,
       -- Status
       status                    status NOT NULL,
       -- Audit / Logs
       created_by                TEXT,
       last_modified_by          TEXT,
       created_at                TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
       last_modified             TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);
-- WITH (OIDS=FALSE);
CREATE UNIQUE INDEX file_id_idx ON local_ega (id);

-- ################## VIEWS #########################
-- Used by data-in
CREATE VIEW files AS
SELECT id, elixir_id, inbox_path, status, vault_path, vault_filesize, stable_id, header, created_at, last_modified
FROM local_ega;

-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
       stable_id AS file_name,
       vault_path AS file_path,
       display_file_name,
       vault_filesize AS file_size,
       checksum, checksum_type, unencrypted_checksum, unencrypted_checksum_type,
       status AS file_status,
       created_by, last_updated_by, header,
       created_at AS created
       last_modified AS last_updated
FROM local_ega;
silverdaz commented 6 years ago

Note that we can add a WHERE-clause if we want to disallow accessing some files.

Or here is another idea: Create a view for (local) EGA files and another view for BioBank files (using yet another where-clause)

blankdots commented 6 years ago

Related issue ? https://github.com/EGA-archive/ega-data-api/issues/10

silverdaz commented 6 years ago

ok, this seems to do it:

\connect lega

SET TIME ZONE 'UTC';

CREATE TYPE checksum_algorithm AS ENUM ('MD5', 'SHA256', 'SHA384', 'SHA512'); -- md5 is bad. Use sha*!
CREATE TYPE storage AS ENUM ('S3', 'POSIX');
-- Note: This is an enum, because that's what the "provided" database supports
--       If a site has its own database already, let them define their keyword in the ENUM 
--       and use it (Notice that their code must be update to push this value into the table)
--       There is no need to agree on how each site should operate their own database
--       What we need is to document where they need to update and what.

-- ##################################################
--                  FILE STATUS
-- ##################################################
CREATE TABLE status (
        id            INTEGER,
    code          VARCHAR(16) NOT NULL,
    description   TEXT,
    -- contraints
    PRIMARY KEY(id), UNIQUE (id), UNIQUE (code)
);

INSERT INTO status(id,code,description)
VALUES (10, 'INIT'        , 'Initializing a file ingestion'),
       (20, 'IN_INGESTION', 'Currently under ingestion'),
       (30, 'ARCHIVED'    , 'File moved to Vault'),
       (40, 'COMPLETED'   , 'File verified in Vault'),
       (50, 'READY'       , 'File ingested, ready for download'),
       -- (60, 'IN_INDEXING', 'Currently under index creation'),
       (0, 'ERROR'       , 'An Error occured, check the error table')
;

-- ##################################################
--                ENCRYPTION FORMAT
-- ##################################################
CREATE TABLE vault_encryption (
       mode          VARCHAR(16) NOT NULL, PRIMARY KEY(mode), UNIQUE (mode),
       description   TEXT
);

INSERT INTO vault_encryption(mode,description)
VALUES ('CRYPT4GH'  , 'Crypt4GH encryption (using version)'),
       ('PGP'       , 'OpenPGP encryption (RFC 4880)'),
       ('AES'       , 'AES encryption with passphrase'),
       ('CUSTOM1'   , 'Custom method 1 for local site'),
       ('CUSTOM2'   , 'Custom method 2 for local site')
    -- ...
;

-- ##################################################
--                        FILES
-- ##################################################
-- Main table with looooots of information
CREATE TABLE main (
       id                        SERIAL, PRIMARY KEY(id), UNIQUE (id),

       -- EGA file ids
       stable_id                 TEXT,

       -- Status
       status                    VARCHAR NOT NULL REFERENCES status (code), -- No "ON DELETE CASCADE": update to the new status
                                                                            --                         in case the old one is deleted 

       -- Original/Submission file
       submission_file_path                     TEXT NOT NULL,
       submission_file_extension                VARCHAR(260) NOT NULL,
       submission_file_calculated_checksum      VARCHAR(128),
       submission_file_calculated_checksum_type checksum_algorithm,

       submission_file_size                     INTEGER NULL,
       submission_user                          TEXT NOT NULL, -- Elixir ID, or internal user

       -- Vault information
       vault_file_reference      TEXT,    -- file path if POSIX, object id if S3
       vault_file_type           storage, -- S3 or POSIX file system
       vault_file_size           INTEGER,
       vault_file_checksum       VARCHAR(128) NULL, -- NOT NULL,
       vault_file_checksum_type  checksum_algorithm,

       -- Encryption/Decryption
       encryption_method         VARCHAR REFERENCES vault_encryption (mode), -- ON DELETE CASCADE,
       version                   INTEGER , -- DEFAULT 1, -- Crypt4GH version
       header                    TEXT,              -- Crypt4GH header
       session_key_checksum      VARCHAR(128) NULL, -- NOT NULL, -- To check if session key already used
       session_key_checksum_type checksum_algorithm,
       -- Note: We can support multiple encryption. See at the end of that file.

       -- Audit / Logs
       created_by                NAME DEFAULT CURRENT_USER, -- Postgres users
       last_modified_by          NAME DEFAULT CURRENT_USER, --
       created_at                TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
       last_modified             TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
)
WITH (
     OIDS=FALSE
);
CREATE UNIQUE INDEX file_id_idx ON main(id);

-- ##################################################
--                      ERRORS
-- ##################################################
CREATE TABLE main_errors (
        id            SERIAL, PRIMARY KEY(id), UNIQUE (id),
    active        BOOLEAN NOT NULL DEFAULT TRUE,
    file_id       INTEGER NOT NULL REFERENCES main(id) ON DELETE CASCADE,
    hostname      TEXT,
    error_type    TEXT NOT NULL,
    msg           TEXT NOT NULL,
    from_user     BOOLEAN DEFAULT FALSE,
    occured_at    TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);

-- ##################################################
--                 Quality Control
-- ##################################################
-- Coming soon to a cinema near you

-- ##################################################
--                      Views
-- ##################################################
-- Used by data-in
CREATE VIEW files AS
SELECT id,
       submission_user                          AS elixir_id,
       submission_file_path                     AS inbox_path,
       submission_file_size                     AS inbox_filesize,
       submission_file_calculated_checksum      AS inbox_file_checksum,
       submission_file_calculated_checksum_type AS inbox_file_checksum_type,
       status,
       vault_file_reference                     AS vault_path,
       vault_file_type                          AS vault_type,
       vault_file_size                          AS vault_filesize,
       stable_id,
       header,  -- Crypt4gh specific
       version,
       session_key_checksum,
       session_key_checksum_type,
       created_at,
       last_modified
FROM main;

-- Insert into main
CREATE FUNCTION insert_file(inpath main.submission_file_path%TYPE,
                eid    main.submission_user%TYPE)
    RETURNS main.id%TYPE AS $insert_file$
    #variable_conflict use_column
    DECLARE
        file_id  main.id%TYPE;
        file_ext main.submission_file_extension%TYPE;
    BEGIN
        file_ext := substring(inpath from '\.([^\.]*)$'); -- extract extension from filename
    INSERT INTO main (submission_file_path,
                      submission_user,
              submission_file_extension,
              status,
              encryption_method) -- hard-code the vault_encryption
    VALUES(inpath,eid,file_ext,'INIT','CRYPT4GH') RETURNING main.id
    INTO file_id;
    RETURN file_id;
    END;
$insert_file$ LANGUAGE plpgsql;

-- Just showing the current/active errors
CREATE VIEW errors AS
SELECT id, file_id, hostname, error_type, msg, from_user, occured_at
FROM main_errors
WHERE active = TRUE;

CREATE FUNCTION insert_error(fid   errors.file_id%TYPE,
                             h     errors.hostname%TYPE,
                             etype errors.error_type%TYPE,
                             msg   errors.msg%TYPE,
                             from_user  errors.from_user%TYPE)
    RETURNS void AS $insert_error$
    BEGIN
       INSERT INTO errors (file_id,hostname,error_type,msg,from_user) VALUES(fid,h,etype,msg,from_user);
       UPDATE files SET status = 'ERROR' WHERE id = fid;
    END;
$insert_error$ LANGUAGE plpgsql;

-- When File becomes 'READY', remove all its errors from current errors.
CREATE FUNCTION mark_ready()
RETURNS TRIGGER AS $mark_ready$
BEGIN
     UPDATE main_errors SET active = FALSE WHERE file_id = NEW.id;  -- or OLD.id
     RETURN NEW;
END;
$mark_ready$ LANGUAGE plpgsql;

CREATE TRIGGER mark_ready 
    AFTER UPDATE OF status ON main -- using the main and not files
                                   -- because "Views cannot have row-level BEFORE or AFTER triggers."
    FOR EACH ROW WHEN (NEW.status = 'READY')
    EXECUTE PROCEDURE mark_ready();

-- #######################################################################
-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
       stable_id AS file_name,
       vault_file_reference AS file_path,
       vault_file_type AS vault_type,
       submission_file_path AS display_file_name,
       vault_file_size AS file_size,
       submission_file_calculated_checksum AS checksum,
       submission_file_calculated_checksum_type AS checksum_type,
       vault_file_checksum AS unencrypted_checksum,
       vault_file_checksum_type AS unencrypted_checksum_type,
       status AS file_status,
       header,
       created_by,
       last_modified_by AS last_updated_by,
       created_at AS created,
       last_modified AS last_updated
FROM main;

-- ##########################################################################
--                   About the encryption
-- ##########################################################################
-- 
-- 
-- We can support multiple encryption types in the vault
-- (Say, for example, Crypt4GH, PGP and plain AES),
-- in the following manner:
-- 
-- We create a table for each method of encryption.
-- Each table will have its own set of fields, refering to data it needs for decryption
-- 
-- Then we update the main file table with a vault_encryption "keyword".
-- That will tell the main file table to look at another table for that
-- particular file. (Note that this file reference is found in only one
-- encryption table).
--
-- At the moment, we added the whole Crypt4gh-related table inside the main table.
-- That's easily changeable, using views.
-- 
-- The site that wants to support multiple encryption methods must update the data-in and
-- data-out code in order to push/pull the data accordingly.
blankdots commented 6 years ago

Compared with the fields descried in https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources and https://github.com/EGA-archive/ega-data-api/issues/10#issuecomment-407743725 At first glance the names seem to be ok ( the focus was on File.sql) with an additional one above vault_type. The data out seems to operate within a certain schema: dev_ega_file and it uses other tables such as downdload_log and event that are not present in the schema above. Maybe a CREATE TABLE IF NOT EXISTS would be nice either on data out or on the schema above. @silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.

Next step would be to see if data out can utilize this schema, and if any changes required on that part.

silverdaz commented 6 years ago

@silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.

I updated the comment with the recent changes.

The data out seems to operate within a certain schema: dev_ega_file and it uses other tables such as downdload_log and event that are not present in the schema above.

I'm not aware of the download_log tables, can you point me to the file loading that schema?

blankdots commented 6 years ago

@silverdaz

I'm not aware of the download_log tables, can you point me to the file loading that schema?

There are multiple files used: https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources . For donwload_log see: https://github.com/EGA-archive/ega-data-api/blob/master/ega-data-api-filedatabase/src/main/resources/DownloadLog.sql

Also refer to this comment for more information: https://github.com/EGA-archive/ega-data-api/issues/10#issuecomment-407743725