debrief / pepys-import

Support library for Pepys maritime data analysis environment
https://pepys-import.readthedocs.io/
Apache License 2.0
5 stars 5 forks source link

Database Review #745

Open IanMayo opened 3 years ago

IanMayo commented 3 years ago

This task is for Arunlal to review the PostGres database instance, and highlight any opportunities for improvement, based on standard best practices.

Suggestions should be created as comments on this issue, ideally with a description of the improvement offered, and a URL t guidance supporting recommendation.

rnllv commented 3 years ago

1) Don't use timestamp (without time zone)

Don't use timestamp without time zone. timestamp works well for applications which are concerned only with one time zone. It relies on the client timezone/SQL library settings to interpret the value stored in the database. When this value is copied from one system to another (data merging, data consolidation, etc.) the value changes based on the default timezone settings in various nodes between the source and destination.

It's always recommended to use *timestamp with timezone* and provide timezone value from UI whenever capturing data. This will ensure all timestamp* calculations are proper (like instant before or after given time, etc. etc.)

Since pepys and debrief applications handle data from multiple time zones, it's highly recommended to switch to timestamp with timezone data type for such fields. The following table lists all the columns which use timestamp without timezone datatype. These columns have to be redefined as timestamp with timezone. We should also consider changing columns defined as date data type for this activity.

Table Name Column Name Data Type
Activations created_date timestamp without time zone
Activations end timestamp without time zone
Activations start timestamp without time zone
Changes created_date timestamp without time zone
ClassificationTypes created_date timestamp without time zone
Comments created_date timestamp without time zone
Comments time timestamp without time zone
CommentTypes created_date timestamp without time zone
CommodityTypes created_date timestamp without time zone
ConfidenceLevels created_date timestamp without time zone
Contacts created_date timestamp without time zone
Contacts time timestamp without time zone
ContactTypes created_date timestamp without time zone
Datafiles created_date timestamp without time zone
DatafileTypes created_date timestamp without time zone
Extractions created_date timestamp without time zone
Geometries created_date timestamp without time zone
Geometries end timestamp without time zone
Geometries start timestamp without time zone
GeometrySubTypes created_date timestamp without time zone
GeometryTypes created_date timestamp without time zone
HostedBy created_date timestamp without time zone
Logs created_date timestamp without time zone
LogsHoldings created_date timestamp without time zone
LogsHoldings time timestamp without time zone
Media created_date timestamp without time zone
Media time timestamp without time zone
MediaTypes created_date timestamp without time zone
Nationalities created_date timestamp without time zone
Participants created_date timestamp without time zone
Participants end timestamp without time zone
Participants start timestamp without time zone
Platforms created_date timestamp without time zone
PlatformTypes created_date timestamp without time zone
Privacies created_date timestamp without time zone
Sensors created_date timestamp without time zone
SensorTypes created_date timestamp without time zone
States created_date timestamp without time zone
States time timestamp without time zone
Synonyms created_date timestamp without time zone
TaggedItems created_date timestamp without time zone
Tags created_date timestamp without time zone
Tasks created_date timestamp without time zone
Tasks end timestamp without time zone
Tasks start timestamp without time zone
UnitTypes created_date timestamp without time zone
Users created_date timestamp without time zone
Changes modified date
HostedBy hosted_from date
HostedBy host_to date
TaggedItems tagged_on date

https://dba.stackexchange.com/questions/59006/what-is-a-valid-use-case-for-using-timestamp-without-time-zone https://stackoverflow.com/questions/9571392/ignoring-time-zones-altogether-in-rails-and-postgresql/9576170#9576170

rnllv commented 3 years ago

2) Using partitioned table for pepys."States".

pepys."States" is one table which grows fast with respect to table size.

It's recommended to use range partitioned tables in such cases as older records are queried less often. Partitioning provides query performance benefits.

The flip side of this approach is that new partitions have to be added periodically based on the defined range. (For example, yearly if range is defined per year.) But this cost is negligible and it can be handled in SQLAlchemy.

Considering Postgres 11 and higher have an evolved level of partitioning features, it's recommended to switch to partitioned tables after upgrading the database platform to Postgres 11 or higher.

rnllv commented 3 years ago

3) Refrain from using PostgreSQL and SQL Standard keywords as Table and Column names

It's not considered a good practise to use reserved/non-reserved keywords as identifiers (across many Databases). As per ANSI standard, these terms are required to be quoted. The usage of such identifiers should be eliminated or reduced.

https://www.drupal.org/project/drupal/issues/2986452 https://www.sqlservercentral.com/forums/topic/why-shouldnt-you-use-keywords-as-column-names https://www.postgresql.org/docs/10/sql-keywords-appendix.html

The following table shows the column names in pepys schema that are keywords in either PostgreSQL or ANSI SQL.

Table Name Column Name PostgreSQL 10 Status SQL 2011 Standard Status SQL 2008 Standard Status SQL 92 Standard Status
Activations end reserved reserved reserved reserved
Activations name non-reserved non-reserved non-reserved non-reserved
Activations start non-reserved reserved reserved
Changes user reserved reserved reserved reserved
ClassificationTypes name non-reserved non-reserved non-reserved non-reserved
Comments content non-reserved non-reserved non-reserved
Comments time non-reserved (cannot be function or type) reserved reserved reserved
CommentTypes name non-reserved non-reserved non-reserved non-reserved
CommodityTypes name non-reserved non-reserved non-reserved non-reserved
ConfidenceLevels name non-reserved non-reserved non-reserved non-reserved
Contacts location non-reserved non-reserved non-reserved
Contacts name non-reserved non-reserved non-reserved non-reserved
Contacts range non-reserved reserved reserved
Contacts time non-reserved (cannot be function or type) reserved reserved reserved
ContactTypes name non-reserved non-reserved non-reserved non-reserved
Datafiles size non-reserved non-reserved reserved
DatafileTypes name non-reserved non-reserved non-reserved non-reserved
Extractions table reserved reserved reserved reserved
Geometries end reserved reserved reserved reserved
Geometries name non-reserved non-reserved non-reserved non-reserved
Geometries start non-reserved reserved reserved
GeometrySubTypes name non-reserved non-reserved non-reserved non-reserved
GeometryTypes name non-reserved non-reserved non-reserved non-reserved
Logs id non-reserved non-reserved
Logs table reserved reserved reserved reserved
LogsHoldings comment non-reserved
LogsHoldings time non-reserved (cannot be function or type) reserved reserved reserved
Media location non-reserved non-reserved non-reserved
Media time non-reserved (cannot be function or type) reserved reserved reserved
MediaTypes name non-reserved non-reserved non-reserved non-reserved
Nationalities name non-reserved non-reserved non-reserved non-reserved
Participants end reserved reserved reserved reserved
Participants force non-reserved
Participants start non-reserved reserved reserved
Platforms name non-reserved non-reserved non-reserved non-reserved
PlatformTypes name non-reserved non-reserved non-reserved non-reserved
Privacies level non-reserved non-reserved non-reserved reserved
Privacies name non-reserved non-reserved non-reserved non-reserved
Sensors name non-reserved non-reserved non-reserved non-reserved
SensorTypes name non-reserved non-reserved non-reserved non-reserved
States location non-reserved non-reserved non-reserved
States time non-reserved (cannot be function or type) reserved reserved reserved
Synonyms table reserved reserved reserved reserved
Tags name non-reserved non-reserved non-reserved non-reserved
Tasks end reserved reserved reserved reserved
Tasks location non-reserved non-reserved non-reserved
Tasks name non-reserved non-reserved non-reserved non-reserved
Tasks start non-reserved reserved reserved
UnitTypes name non-reserved non-reserved non-reserved non-reserved
Users name non-reserved non-reserved non-reserved non-reserved

List of tables which are named after keywords:

pepys.Comments

rnllv commented 3 years ago

4) Marking primary key fields as "NOT NULL" to be avoided.

All tables in pepys schema have their primary key columns marked as "NOT NULL". This is redundant and should be avoided.

For e.x., consider the DDL scripts for pepys."Activations"

CREATE TABLE pepys."Activations" (
    activation_id uuid NOT NULL,
    ....
    remarks text
);

ALTER TABLE ONLY pepys."Activations"
    ADD CONSTRAINT "pk_Activations" PRIMARY KEY (activation_id);

The activation_id column in marked as NOT NULL and PRIMARY KEY. By definition, primary key fields are not null.

The following table lists all such primary key columns which have been explicitly marked as "NOT NULL".

Table Name Column Name Primary Key Constraint Name
Activations activation_id pk_Activations
alembic_version version_num alembic_version_pkc
Changes change_id pk_Changes
ClassificationTypes class_type_id pk_ClassificationTypes
Comments comment_id pk_Comments
CommentTypes comment_type_id pk_CommentTypes
CommodityTypes commodity_type_id pk_CommodityTypes
ConfidenceLevels confidence_level_id pk_ConfidenceLevels
Contacts contact_id pk_Contacts
ContactTypes contact_type_id pk_ContactTypes
Datafiles datafile_id pk_Datafiles
DatafileTypes datafile_type_id pk_DatafileTypes
Extractions extraction_id pk_Extractions
Geometries geometry_id pk_Geometries
GeometrySubTypes geo_sub_type_id pk_GeometrySubTypes
GeometryTypes geo_type_id pk_GeometryTypes
HostedBy hosted_by_id pk_HostedBy
Logs log_id pk_Logs
LogsHoldings logs_holding_id pk_LogsHoldings
Media media_id pk_Media
MediaTypes media_type_id pk_MediaTypes
Nationalities nationality_id pk_Nationalities
Participants participant_id pk_Participants
Platforms platform_id pk_Platforms
PlatformTypes platform_type_id pk_PlatformTypes
Privacies privacy_id pk_Privacies
Sensors sensor_id pk_Sensors
SensorTypes sensor_type_id pk_SensorTypes
States state_id pk_States
Synonyms synonym_id pk_Synonyms
TaggedItems tagged_item_id pk_TaggedItems
Tags tag_id pk_Tags
Tasks task_id pk_Tasks
UnitTypes unit_type_id pk_UnitTypes
Users user_id pk_Users
rnllv commented 3 years ago

5) Case sensitive table and constraint names should be avoided

PostgreSQL follows ISO SQL standard where all unquoted identifiers are converted to lowercase and case is retained when identifiers are quoted. It's recommended to stick with lower case definitions

The only known advantage of using case-sensitive identifiers in PostgreSQL is to use same identifier in more than one place with different cases. But this practise is also frowned upon. Case-sensitive identifiers also require users/admin to quote the identifiers whenever querying these objects from their SQL client.

https://dev.to/lefebvre/dont-get-bit-by-postgresql-case-sensitivity--457i https://stackoverflow.com/questions/12076995/databases-why-case-insensitive http://sqlfiddle.com/#!17/8cf7f/4

All tables in the pepys database (except for alembic_version table) are explicitly case-sensitive.

Table Name Change to
Activations activations
Changes changes
ClassificationTypes classification_types
Comments comments
CommentTypes comment_types
CommodityTypes commodity_types
ConfidenceLevels confidence_levels
Contacts contacts
ContactTypes contact_types
Datafiles datafiles
DatafileTypes datafile_types
Extractions extractions
Geometries geometries
GeometrySubTypes geometry_sub_types
GeometryTypes geometry_types
HostedBy hosted_by
Logs logs
LogsHoldings logs_holdings
Media media
MediaTypes media_types
Nationalities nationalities
Participants participants
Platforms platforms
PlatformTypes platform_types
Privacies privacies
Sensors sensors
SensorTypes sensor_types
States states
Synonyms synonyms
TaggedItems tagged_items
Tags tags
Tasks tasks
UnitTypes unit_types
Users users
rnllv commented 3 years ago

6) Using text or varchar instead of varchar(n)

The official recommendation from PostgreSQL is to use text or varchar (without size restrictions) in place of varchar(n) data types.

Restricting length with check constraints is considered far more beneficial than defining a varchar column with length restriction.

The following table lists all columns in the pepys database which uses varchar datatype with length restriction.

Table Name Varchar Column Length
Activations name 150
alembic_version version_num 32
Changes reason 500
Changes user 150
ClassificationTypes name 150
CommentTypes name 150
CommodityTypes name 150
ConfidenceLevels name 150
Contacts name 150
Contacts track_number 20
ContactTypes name 150
Datafiles hash 32
Datafiles reference 150
Datafiles url 150
DatafileTypes name 150
Extractions chars 150
Extractions field 150
Extractions table 150
Geometries name 150
GeometrySubTypes name 150
GeometryTypes name 150
Logs field 150
Logs new_value 150
Logs table 150
Media url 150
MediaTypes name 150
Nationalities name 150
Participants force 150
Platforms identifier 10
Platforms name 150
Platforms quadgraph 4
Platforms trigraph 3
PlatformTypes name 150
Privacies name 150
Sensors name 150
SensorTypes name 150
Synonyms synonym 150
Synonyms table 150
Tags name 150
Tasks environment 150
Tasks location 150
Tasks name 150
UnitTypes name 150
Users name 150
rnllv commented 3 years ago

7) Type casting in constraint

Though I don't have a concrete source to back this up, trials indicate that type casting in constraint have some impact to bulk insert performance. This is still an open point. Further details can be found in this StackExchange page. Thanks @IanMayo for helping with the trials performed in this regard.

The trials were performed with ~97 million records. And the performance impact seen is very minimal (matter of few seconds).

This recommendation would be invalidated by the previous one, if implemented, i.e. Using text or varchar instead of varchar(n)

The following is the list of constraints where explicit type casting to another datatype is performed.

ck_Changes_ck_Changes_reason ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name

rnllv commented 3 years ago

8) Using smallint datatype for low-cardinality columns

We can consider using smallint datatype instead of integer for the below low-cardinality columns. Read/Write performance will be enhanced due to its smaller size.

Table Name Column Name
Datafiles size
Nationalities priority
Privacies level

PostgreSQL datatype description for int types is as follows.

Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
rnllv commented 3 years ago

9) Database hygiene suggestions

The following suggestions are just for database hygiene and has nothing to do with application performance.

a) Expanding columns names

We can consider expanding the following columns with proper names.

Table Name Column Name New Name
Contacts ambig_bearing bearing_ambiguity
Contacts freq frequency
Contacts mla mean_line_of_advance
Contacts rel_bearing relative_bearing
Contacts soa speed_of_advance
Extractions chars characters
Logs id log_id

@IanMayo, can you suggest the names where it's marked as not available <<N.A.>> ?

b) Changing the names of the following constraints from mixed case to lower case

ck_Changes_ck_Changes_reason ck_Changes_ck_Changes_user ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_Comments_ck_Comments_content ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name fk_Activations_privacy_id_Privacies fk_Activations_sensor_id_Sensors fk_Activations_source_id_Datafiles fk_Changes_datafile_id_Datafiles fk_Comments_comment_type_id_CommentTypes fk_Comments_platform_id_Platforms fk_Comments_privacy_id_Privacies fk_Comments_source_id_Datafiles fk_Contacts_classification_ClassificationTypes fk_Contacts_confidence_ConfidenceLevels fk_Contacts_contact_type_ContactTypes fk_Contacts_privacy_id_Privacies fk_Contacts_sensor_id_Sensors fk_Contacts_source_id_Datafiles fk_Contacts_subject_id_Platforms fk_Datafiles_datafile_type_id_DatafileTypes fk_Datafiles_privacy_id_Privacies fk_Geometries_geo_sub_type_id_GeometrySubTypes fk_Geometries_geo_type_id_GeometryTypes fk_Geometries_privacy_id_Privacies fk_Geometries_sensor_platform_id_Platforms fk_Geometries_source_id_Datafiles fk_Geometries_subject_platform_id_Platforms fk_Geometries_task_id_Tasks fk_GeometrySubTypes_parent_GeometryTypes fk_HostedBy_host_id_Platforms fk_HostedBy_privacy_id_Privacies fk_HostedBy_subject_id_Platforms fk_Logs_change_id_Changes fk_LogsHoldings_commodity_id_CommodityTypes fk_LogsHoldings_platform_id_Platforms fk_LogsHoldings_privacy_id_Privacies fk_LogsHoldings_source_id_Datafiles fk_LogsHoldings_unit_type_id_UnitTypes fk_Media_media_type_id_MediaTypes fk_Media_platform_id_Platforms fk_Media_privacy_id_Privacies fk_Media_sensor_id_Sensors fk_Media_source_id_Datafiles fk_Media_subject_id_Platforms fk_Participants_platform_id_Platforms fk_Participants_privacy_id_Privacies fk_Participants_task_id_Tasks fk_Platforms_nationality_id_Nationalities fk_Platforms_platform_type_id_PlatformTypes fk_Platforms_privacy_id_Privacies fk_Sensors_host_Platforms fk_Sensors_privacy_id_Privacies fk_Sensors_sensor_type_id_SensorTypes fk_States_privacy_id_Privacies fk_States_sensor_id_Sensors fk_States_source_id_Datafiles fk_TaggedItems_tagged_by_id_Users fk_TaggedItems_tag_id_Tags fk_Tasks_parent_id_Tasks fk_Tasks_privacy_id_Privacies pk_Activations pk_Changes pk_ClassificationTypes pk_Comments pk_CommentTypes pk_CommodityTypes pk_ConfidenceLevels pk_Contacts pk_ContactTypes pk_Datafiles pk_DatafileTypes pk_Extractions pk_Geometries pk_GeometrySubTypes pk_GeometryTypes pk_HostedBy pk_Logs pk_LogsHoldings pk_Media pk_MediaTypes pk_Nationalities pk_Participants pk_Platforms pk_PlatformTypes pk_Privacies pk_Sensors pk_SensorTypes pk_States pk_Synonyms pk_TaggedItems pk_Tags pk_Tasks pk_UnitTypes pk_Users uq_ClassificationTypes_name uq_CommentTypes_name uq_CommodityTypes_name uq_ConfidenceLevels_name uq_ContactTypes_name uq_Datafile_size_hash uq_DatafileTypes_name uq_GeometrySubType_name_parent uq_GeometryTypes_name uq_MediaTypes_name uq_Nationalities_name uq_Platform_name_nat_identifier uq_PlatformTypes_name uq_Privacies_name uq_SensorTypes_name uq_UnitTypes_name uq_Users_name

c) Defining constraints externally

It's seen that all CHECK and NOT NULL constraints are defined inline (along with the table definition) where as all the other constraints are defined externally (using ALTER TABLE ... ADD CONSTRAINT).

For the sake of consistency, it's recommended that all CHECK and NOT NULL constraint definitions are also moved externally along with the other constraint definitions.

rnllv commented 3 years ago

10) Few suggestions on PostGis indexes

The following blog has some suggestions on selecting the right indexes for PostGis datatypes and operations. As I'm not fully aware of Pepys/Debrief functionality, kindly check them to see if anything is applicable.

https://www.alibabacloud.com/blog/postgresql-best-practices-selection-and-optimization-of-postgis-spatial-indexes-gist-brin-and-r-tree_597034