Open IanMayo opened 3 years ago
Don't use timestamp without time zone. timestamp works well for applications which are concerned only with one time zone. It relies on the client timezone/SQL library settings to interpret the value stored in the database. When this value is copied from one system to another (data merging, data consolidation, etc.) the value changes based on the default timezone settings in various nodes between the source and destination.
It's always recommended to use *timestamp with timezone* and provide timezone value from UI whenever capturing data. This will ensure all timestamp* calculations are proper (like instant before or after given time, etc. etc.)
Since pepys and debrief applications handle data from multiple time zones, it's highly recommended to switch to timestamp with timezone data type for such fields. The following table lists all the columns which use timestamp without timezone datatype. These columns have to be redefined as timestamp with timezone. We should also consider changing columns defined as date data type for this activity.
Table Name | Column Name | Data Type |
---|---|---|
Activations | created_date | timestamp without time zone |
Activations | end | timestamp without time zone |
Activations | start | timestamp without time zone |
Changes | created_date | timestamp without time zone |
ClassificationTypes | created_date | timestamp without time zone |
Comments | created_date | timestamp without time zone |
Comments | time | timestamp without time zone |
CommentTypes | created_date | timestamp without time zone |
CommodityTypes | created_date | timestamp without time zone |
ConfidenceLevels | created_date | timestamp without time zone |
Contacts | created_date | timestamp without time zone |
Contacts | time | timestamp without time zone |
ContactTypes | created_date | timestamp without time zone |
Datafiles | created_date | timestamp without time zone |
DatafileTypes | created_date | timestamp without time zone |
Extractions | created_date | timestamp without time zone |
Geometries | created_date | timestamp without time zone |
Geometries | end | timestamp without time zone |
Geometries | start | timestamp without time zone |
GeometrySubTypes | created_date | timestamp without time zone |
GeometryTypes | created_date | timestamp without time zone |
HostedBy | created_date | timestamp without time zone |
Logs | created_date | timestamp without time zone |
LogsHoldings | created_date | timestamp without time zone |
LogsHoldings | time | timestamp without time zone |
Media | created_date | timestamp without time zone |
Media | time | timestamp without time zone |
MediaTypes | created_date | timestamp without time zone |
Nationalities | created_date | timestamp without time zone |
Participants | created_date | timestamp without time zone |
Participants | end | timestamp without time zone |
Participants | start | timestamp without time zone |
Platforms | created_date | timestamp without time zone |
PlatformTypes | created_date | timestamp without time zone |
Privacies | created_date | timestamp without time zone |
Sensors | created_date | timestamp without time zone |
SensorTypes | created_date | timestamp without time zone |
States | created_date | timestamp without time zone |
States | time | timestamp without time zone |
Synonyms | created_date | timestamp without time zone |
TaggedItems | created_date | timestamp without time zone |
Tags | created_date | timestamp without time zone |
Tasks | created_date | timestamp without time zone |
Tasks | end | timestamp without time zone |
Tasks | start | timestamp without time zone |
UnitTypes | created_date | timestamp without time zone |
Users | created_date | timestamp without time zone |
Changes | modified | date |
HostedBy | hosted_from | date |
HostedBy | host_to | date |
TaggedItems | tagged_on | date |
https://dba.stackexchange.com/questions/59006/what-is-a-valid-use-case-for-using-timestamp-without-time-zone https://stackoverflow.com/questions/9571392/ignoring-time-zones-altogether-in-rails-and-postgresql/9576170#9576170
pepys."States" is one table which grows fast with respect to table size.
It's recommended to use range partitioned tables in such cases as older records are queried less often. Partitioning provides query performance benefits.
The flip side of this approach is that new partitions have to be added periodically based on the defined range. (For example, yearly if range is defined per year.) But this cost is negligible and it can be handled in SQLAlchemy.
Considering Postgres 11 and higher have an evolved level of partitioning features, it's recommended to switch to partitioned tables after upgrading the database platform to Postgres 11 or higher.
It's not considered a good practise to use reserved/non-reserved keywords as identifiers (across many Databases). As per ANSI standard, these terms are required to be quoted. The usage of such identifiers should be eliminated or reduced.
https://www.drupal.org/project/drupal/issues/2986452 https://www.sqlservercentral.com/forums/topic/why-shouldnt-you-use-keywords-as-column-names https://www.postgresql.org/docs/10/sql-keywords-appendix.html
The following table shows the column names in pepys schema that are keywords in either PostgreSQL or ANSI SQL.
Table Name | Column Name | PostgreSQL 10 Status | SQL 2011 Standard Status | SQL 2008 Standard Status | SQL 92 Standard Status |
---|---|---|---|---|---|
Activations | end | reserved | reserved | reserved | reserved |
Activations | name | non-reserved | non-reserved | non-reserved | non-reserved |
Activations | start | non-reserved | reserved | reserved | |
Changes | user | reserved | reserved | reserved | reserved |
ClassificationTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Comments | content | non-reserved | non-reserved | non-reserved | |
Comments | time | non-reserved (cannot be function or type) | reserved | reserved | reserved |
CommentTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
CommodityTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
ConfidenceLevels | name | non-reserved | non-reserved | non-reserved | non-reserved |
Contacts | location | non-reserved | non-reserved | non-reserved | |
Contacts | name | non-reserved | non-reserved | non-reserved | non-reserved |
Contacts | range | non-reserved | reserved | reserved | |
Contacts | time | non-reserved (cannot be function or type) | reserved | reserved | reserved |
ContactTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Datafiles | size | non-reserved | non-reserved | reserved | |
DatafileTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Extractions | table | reserved | reserved | reserved | reserved |
Geometries | end | reserved | reserved | reserved | reserved |
Geometries | name | non-reserved | non-reserved | non-reserved | non-reserved |
Geometries | start | non-reserved | reserved | reserved | |
GeometrySubTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
GeometryTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Logs | id | non-reserved | non-reserved | ||
Logs | table | reserved | reserved | reserved | reserved |
LogsHoldings | comment | non-reserved | |||
LogsHoldings | time | non-reserved (cannot be function or type) | reserved | reserved | reserved |
Media | location | non-reserved | non-reserved | non-reserved | |
Media | time | non-reserved (cannot be function or type) | reserved | reserved | reserved |
MediaTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Nationalities | name | non-reserved | non-reserved | non-reserved | non-reserved |
Participants | end | reserved | reserved | reserved | reserved |
Participants | force | non-reserved | |||
Participants | start | non-reserved | reserved | reserved | |
Platforms | name | non-reserved | non-reserved | non-reserved | non-reserved |
PlatformTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Privacies | level | non-reserved | non-reserved | non-reserved | reserved |
Privacies | name | non-reserved | non-reserved | non-reserved | non-reserved |
Sensors | name | non-reserved | non-reserved | non-reserved | non-reserved |
SensorTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
States | location | non-reserved | non-reserved | non-reserved | |
States | time | non-reserved (cannot be function or type) | reserved | reserved | reserved |
Synonyms | table | reserved | reserved | reserved | reserved |
Tags | name | non-reserved | non-reserved | non-reserved | non-reserved |
Tasks | end | reserved | reserved | reserved | reserved |
Tasks | location | non-reserved | non-reserved | non-reserved | |
Tasks | name | non-reserved | non-reserved | non-reserved | non-reserved |
Tasks | start | non-reserved | reserved | reserved | |
UnitTypes | name | non-reserved | non-reserved | non-reserved | non-reserved |
Users | name | non-reserved | non-reserved | non-reserved | non-reserved |
List of tables which are named after keywords:
pepys.Comments
All tables in pepys schema have their primary key columns marked as "NOT NULL". This is redundant and should be avoided.
For e.x., consider the DDL scripts for pepys."Activations"
CREATE TABLE pepys."Activations" (
activation_id uuid NOT NULL,
....
remarks text
);
ALTER TABLE ONLY pepys."Activations"
ADD CONSTRAINT "pk_Activations" PRIMARY KEY (activation_id);
The activation_id column in marked as NOT NULL and PRIMARY KEY. By definition, primary key fields are not null.
The following table lists all such primary key columns which have been explicitly marked as "NOT NULL".
Table Name | Column Name | Primary Key Constraint Name |
---|---|---|
Activations | activation_id | pk_Activations |
alembic_version | version_num | alembic_version_pkc |
Changes | change_id | pk_Changes |
ClassificationTypes | class_type_id | pk_ClassificationTypes |
Comments | comment_id | pk_Comments |
CommentTypes | comment_type_id | pk_CommentTypes |
CommodityTypes | commodity_type_id | pk_CommodityTypes |
ConfidenceLevels | confidence_level_id | pk_ConfidenceLevels |
Contacts | contact_id | pk_Contacts |
ContactTypes | contact_type_id | pk_ContactTypes |
Datafiles | datafile_id | pk_Datafiles |
DatafileTypes | datafile_type_id | pk_DatafileTypes |
Extractions | extraction_id | pk_Extractions |
Geometries | geometry_id | pk_Geometries |
GeometrySubTypes | geo_sub_type_id | pk_GeometrySubTypes |
GeometryTypes | geo_type_id | pk_GeometryTypes |
HostedBy | hosted_by_id | pk_HostedBy |
Logs | log_id | pk_Logs |
LogsHoldings | logs_holding_id | pk_LogsHoldings |
Media | media_id | pk_Media |
MediaTypes | media_type_id | pk_MediaTypes |
Nationalities | nationality_id | pk_Nationalities |
Participants | participant_id | pk_Participants |
Platforms | platform_id | pk_Platforms |
PlatformTypes | platform_type_id | pk_PlatformTypes |
Privacies | privacy_id | pk_Privacies |
Sensors | sensor_id | pk_Sensors |
SensorTypes | sensor_type_id | pk_SensorTypes |
States | state_id | pk_States |
Synonyms | synonym_id | pk_Synonyms |
TaggedItems | tagged_item_id | pk_TaggedItems |
Tags | tag_id | pk_Tags |
Tasks | task_id | pk_Tasks |
UnitTypes | unit_type_id | pk_UnitTypes |
Users | user_id | pk_Users |
PostgreSQL follows ISO SQL standard where all unquoted identifiers are converted to lowercase and case is retained when identifiers are quoted. It's recommended to stick with lower case definitions
The only known advantage of using case-sensitive identifiers in PostgreSQL is to use same identifier in more than one place with different cases. But this practise is also frowned upon. Case-sensitive identifiers also require users/admin to quote the identifiers whenever querying these objects from their SQL client.
https://dev.to/lefebvre/dont-get-bit-by-postgresql-case-sensitivity--457i https://stackoverflow.com/questions/12076995/databases-why-case-insensitive http://sqlfiddle.com/#!17/8cf7f/4
All tables in the pepys database (except for alembic_version table) are explicitly case-sensitive.
Table Name | Change to |
---|---|
Activations | activations |
Changes | changes |
ClassificationTypes | classification_types |
Comments | comments |
CommentTypes | comment_types |
CommodityTypes | commodity_types |
ConfidenceLevels | confidence_levels |
Contacts | contacts |
ContactTypes | contact_types |
Datafiles | datafiles |
DatafileTypes | datafile_types |
Extractions | extractions |
Geometries | geometries |
GeometrySubTypes | geometry_sub_types |
GeometryTypes | geometry_types |
HostedBy | hosted_by |
Logs | logs |
LogsHoldings | logs_holdings |
Media | media |
MediaTypes | media_types |
Nationalities | nationalities |
Participants | participants |
Platforms | platforms |
PlatformTypes | platform_types |
Privacies | privacies |
Sensors | sensors |
SensorTypes | sensor_types |
States | states |
Synonyms | synonyms |
TaggedItems | tagged_items |
Tags | tags |
Tasks | tasks |
UnitTypes | unit_types |
Users | users |
The official recommendation from PostgreSQL is to use text or varchar (without size restrictions) in place of varchar(n) data types.
Restricting length with check constraints is considered far more beneficial than defining a varchar column with length restriction.
The following table lists all columns in the pepys database which uses varchar datatype with length restriction.
Table Name | Varchar Column | Length |
---|---|---|
Activations | name | 150 |
alembic_version | version_num | 32 |
Changes | reason | 500 |
Changes | user | 150 |
ClassificationTypes | name | 150 |
CommentTypes | name | 150 |
CommodityTypes | name | 150 |
ConfidenceLevels | name | 150 |
Contacts | name | 150 |
Contacts | track_number | 20 |
ContactTypes | name | 150 |
Datafiles | hash | 32 |
Datafiles | reference | 150 |
Datafiles | url | 150 |
DatafileTypes | name | 150 |
Extractions | chars | 150 |
Extractions | field | 150 |
Extractions | table | 150 |
Geometries | name | 150 |
GeometrySubTypes | name | 150 |
GeometryTypes | name | 150 |
Logs | field | 150 |
Logs | new_value | 150 |
Logs | table | 150 |
Media | url | 150 |
MediaTypes | name | 150 |
Nationalities | name | 150 |
Participants | force | 150 |
Platforms | identifier | 10 |
Platforms | name | 150 |
Platforms | quadgraph | 4 |
Platforms | trigraph | 3 |
PlatformTypes | name | 150 |
Privacies | name | 150 |
Sensors | name | 150 |
SensorTypes | name | 150 |
Synonyms | synonym | 150 |
Synonyms | table | 150 |
Tags | name | 150 |
Tasks | environment | 150 |
Tasks | location | 150 |
Tasks | name | 150 |
UnitTypes | name | 150 |
Users | name | 150 |
Though I don't have a concrete source to back this up, trials indicate that type casting in constraint have some impact to bulk insert performance. This is still an open point. Further details can be found in this StackExchange page. Thanks @IanMayo for helping with the trials performed in this regard.
The trials were performed with ~97 million records. And the performance impact seen is very minimal (matter of few seconds).
This recommendation would be invalidated by the previous one, if implemented, i.e. Using text or varchar instead of varchar(n)
The following is the list of constraints where explicit type casting to another datatype is performed.
ck_Changes_ck_Changes_reason ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name
We can consider using smallint datatype instead of integer for the below low-cardinality columns. Read/Write performance will be enhanced due to its smaller size.
Table Name | Column Name |
---|---|
Datafiles | size |
Nationalities | priority |
Privacies | level |
PostgreSQL datatype description for int types is as follows.
Name | Storage Size | Description | Range |
---|---|---|---|
smallint | 2 bytes | small-range integer | -32768 to +32767 |
integer | 4 bytes | typical choice for integer | -2147483648 to +2147483647 |
The following suggestions are just for database hygiene and has nothing to do with application performance.
We can consider expanding the following columns with proper names.
Table Name | Column Name | New Name |
---|---|---|
Contacts | ambig_bearing | bearing_ambiguity |
Contacts | freq | frequency |
Contacts | mla | mean_line_of_advance |
Contacts | rel_bearing | relative_bearing |
Contacts | soa | speed_of_advance |
Extractions | chars | characters |
Logs | id | log_id |
@IanMayo, can you suggest the names where it's marked as not available <<N.A.>> ?
ck_Changes_ck_Changes_reason ck_Changes_ck_Changes_user ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_Comments_ck_Comments_content ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name fk_Activations_privacy_id_Privacies fk_Activations_sensor_id_Sensors fk_Activations_source_id_Datafiles fk_Changes_datafile_id_Datafiles fk_Comments_comment_type_id_CommentTypes fk_Comments_platform_id_Platforms fk_Comments_privacy_id_Privacies fk_Comments_source_id_Datafiles fk_Contacts_classification_ClassificationTypes fk_Contacts_confidence_ConfidenceLevels fk_Contacts_contact_type_ContactTypes fk_Contacts_privacy_id_Privacies fk_Contacts_sensor_id_Sensors fk_Contacts_source_id_Datafiles fk_Contacts_subject_id_Platforms fk_Datafiles_datafile_type_id_DatafileTypes fk_Datafiles_privacy_id_Privacies fk_Geometries_geo_sub_type_id_GeometrySubTypes fk_Geometries_geo_type_id_GeometryTypes fk_Geometries_privacy_id_Privacies fk_Geometries_sensor_platform_id_Platforms fk_Geometries_source_id_Datafiles fk_Geometries_subject_platform_id_Platforms fk_Geometries_task_id_Tasks fk_GeometrySubTypes_parent_GeometryTypes fk_HostedBy_host_id_Platforms fk_HostedBy_privacy_id_Privacies fk_HostedBy_subject_id_Platforms fk_Logs_change_id_Changes fk_LogsHoldings_commodity_id_CommodityTypes fk_LogsHoldings_platform_id_Platforms fk_LogsHoldings_privacy_id_Privacies fk_LogsHoldings_source_id_Datafiles fk_LogsHoldings_unit_type_id_UnitTypes fk_Media_media_type_id_MediaTypes fk_Media_platform_id_Platforms fk_Media_privacy_id_Privacies fk_Media_sensor_id_Sensors fk_Media_source_id_Datafiles fk_Media_subject_id_Platforms fk_Participants_platform_id_Platforms fk_Participants_privacy_id_Privacies fk_Participants_task_id_Tasks fk_Platforms_nationality_id_Nationalities fk_Platforms_platform_type_id_PlatformTypes fk_Platforms_privacy_id_Privacies fk_Sensors_host_Platforms fk_Sensors_privacy_id_Privacies fk_Sensors_sensor_type_id_SensorTypes fk_States_privacy_id_Privacies fk_States_sensor_id_Sensors fk_States_source_id_Datafiles fk_TaggedItems_tagged_by_id_Users fk_TaggedItems_tag_id_Tags fk_Tasks_parent_id_Tasks fk_Tasks_privacy_id_Privacies pk_Activations pk_Changes pk_ClassificationTypes pk_Comments pk_CommentTypes pk_CommodityTypes pk_ConfidenceLevels pk_Contacts pk_ContactTypes pk_Datafiles pk_DatafileTypes pk_Extractions pk_Geometries pk_GeometrySubTypes pk_GeometryTypes pk_HostedBy pk_Logs pk_LogsHoldings pk_Media pk_MediaTypes pk_Nationalities pk_Participants pk_Platforms pk_PlatformTypes pk_Privacies pk_Sensors pk_SensorTypes pk_States pk_Synonyms pk_TaggedItems pk_Tags pk_Tasks pk_UnitTypes pk_Users uq_ClassificationTypes_name uq_CommentTypes_name uq_CommodityTypes_name uq_ConfidenceLevels_name uq_ContactTypes_name uq_Datafile_size_hash uq_DatafileTypes_name uq_GeometrySubType_name_parent uq_GeometryTypes_name uq_MediaTypes_name uq_Nationalities_name uq_Platform_name_nat_identifier uq_PlatformTypes_name uq_Privacies_name uq_SensorTypes_name uq_UnitTypes_name uq_Users_name
It's seen that all CHECK and NOT NULL constraints are defined inline (along with the table definition) where as all the other constraints are defined externally (using ALTER TABLE ... ADD CONSTRAINT).
For the sake of consistency, it's recommended that all CHECK and NOT NULL constraint definitions are also moved externally along with the other constraint definitions.
The following blog has some suggestions on selecting the right indexes for PostGis datatypes and operations. As I'm not fully aware of Pepys/Debrief functionality, kindly check them to see if anything is applicable.
This task is for Arunlal to review the PostGres database instance, and highlight any opportunities for improvement, based on standard best practices.
Suggestions should be created as comments on this issue, ideally with a description of the improvement offered, and a URL t guidance supporting recommendation.