Database Review - Githubissues

IanMayo commented 3 years ago

This task is for Arunlal to review the PostGres database instance, and highlight any opportunities for improvement, based on standard best practices.

Suggestions should be created as comments on this issue, ideally with a description of the improvement offered, and a URL t guidance supporting recommendation.

rnllv commented 3 years ago

1) Don't use timestamp (without time zone)

Don't use timestamp without time zone. timestamp works well for applications which are concerned only with one time zone. It relies on the client timezone/SQL library settings to interpret the value stored in the database. When this value is copied from one system to another (data merging, data consolidation, etc.) the value changes based on the default timezone settings in various nodes between the source and destination.

It's always recommended to use *timestamp with timezone* and provide timezone value from UI whenever capturing data. This will ensure all timestamp* calculations are proper (like instant before or after given time, etc. etc.)

Since pepys and debrief applications handle data from multiple time zones, it's highly recommended to switch to timestamp with timezone data type for such fields. The following table lists all the columns which use timestamp without timezone datatype. These columns have to be redefined as timestamp with timezone. We should also consider changing columns defined as date data type for this activity.

Table Name	Column Name	Data Type
Activations	created_date	timestamp without time zone
Activations	end	timestamp without time zone
Activations	start	timestamp without time zone
Changes	created_date	timestamp without time zone
ClassificationTypes	created_date	timestamp without time zone
Comments	created_date	timestamp without time zone
Comments	time	timestamp without time zone
CommentTypes	created_date	timestamp without time zone
CommodityTypes	created_date	timestamp without time zone
ConfidenceLevels	created_date	timestamp without time zone
Contacts	created_date	timestamp without time zone
Contacts	time	timestamp without time zone
ContactTypes	created_date	timestamp without time zone
Datafiles	created_date	timestamp without time zone
DatafileTypes	created_date	timestamp without time zone
Extractions	created_date	timestamp without time zone
Geometries	created_date	timestamp without time zone
Geometries	end	timestamp without time zone
Geometries	start	timestamp without time zone
GeometrySubTypes	created_date	timestamp without time zone
GeometryTypes	created_date	timestamp without time zone
HostedBy	created_date	timestamp without time zone
Logs	created_date	timestamp without time zone
LogsHoldings	created_date	timestamp without time zone
LogsHoldings	time	timestamp without time zone
Media	created_date	timestamp without time zone
Media	time	timestamp without time zone
MediaTypes	created_date	timestamp without time zone
Nationalities	created_date	timestamp without time zone
Participants	created_date	timestamp without time zone
Participants	end	timestamp without time zone
Participants	start	timestamp without time zone
Platforms	created_date	timestamp without time zone
PlatformTypes	created_date	timestamp without time zone
Privacies	created_date	timestamp without time zone
Sensors	created_date	timestamp without time zone
SensorTypes	created_date	timestamp without time zone
States	created_date	timestamp without time zone
States	time	timestamp without time zone
Synonyms	created_date	timestamp without time zone
TaggedItems	created_date	timestamp without time zone
Tags	created_date	timestamp without time zone
Tasks	created_date	timestamp without time zone
Tasks	end	timestamp without time zone
Tasks	start	timestamp without time zone
UnitTypes	created_date	timestamp without time zone
Users	created_date	timestamp without time zone
Changes	modified	date
HostedBy	hosted_from	date
HostedBy	host_to	date
TaggedItems	tagged_on	date

https://dba.stackexchange.com/questions/59006/what-is-a-valid-use-case-for-using-timestamp-without-time-zone https://stackoverflow.com/questions/9571392/ignoring-time-zones-altogether-in-rails-and-postgresql/9576170#9576170

rnllv commented 3 years ago

2) Using partitioned table for pepys."States".

pepys."States" is one table which grows fast with respect to table size.

It's recommended to use range partitioned tables in such cases as older records are queried less often. Partitioning provides query performance benefits.

The flip side of this approach is that new partitions have to be added periodically based on the defined range. (For example, yearly if range is defined per year.) But this cost is negligible and it can be handled in SQLAlchemy.

Considering Postgres 11 and higher have an evolved level of partitioning features, it's recommended to switch to partitioned tables after upgrading the database platform to Postgres 11 or higher.

rnllv commented 3 years ago

3) Refrain from using PostgreSQL and SQL Standard keywords as Table and Column names

It's not considered a good practise to use reserved/non-reserved keywords as identifiers (across many Databases). As per ANSI standard, these terms are required to be quoted. The usage of such identifiers should be eliminated or reduced.

https://www.drupal.org/project/drupal/issues/2986452 https://www.sqlservercentral.com/forums/topic/why-shouldnt-you-use-keywords-as-column-names https://www.postgresql.org/docs/10/sql-keywords-appendix.html

The following table shows the column names in pepys schema that are keywords in either PostgreSQL or ANSI SQL.

Table Name	Column Name	PostgreSQL 10 Status	SQL 2011 Standard Status	SQL 2008 Standard Status	SQL 92 Standard Status
Activations	end	reserved	reserved	reserved	reserved
Activations	name	non-reserved	non-reserved	non-reserved	non-reserved
Activations	start	non-reserved	reserved	reserved
Changes	user	reserved	reserved	reserved	reserved
ClassificationTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Comments	content	non-reserved	non-reserved	non-reserved
Comments	time	non-reserved (cannot be function or type)	reserved	reserved	reserved
CommentTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
CommodityTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
ConfidenceLevels	name	non-reserved	non-reserved	non-reserved	non-reserved
Contacts	location	non-reserved	non-reserved	non-reserved
Contacts	name	non-reserved	non-reserved	non-reserved	non-reserved
Contacts	range	non-reserved	reserved	reserved
Contacts	time	non-reserved (cannot be function or type)	reserved	reserved	reserved
ContactTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Datafiles	size		non-reserved	non-reserved	reserved
DatafileTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Extractions	table	reserved	reserved	reserved	reserved
Geometries	end	reserved	reserved	reserved	reserved
Geometries	name	non-reserved	non-reserved	non-reserved	non-reserved
Geometries	start	non-reserved	reserved	reserved
GeometrySubTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
GeometryTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Logs	id		non-reserved	non-reserved
Logs	table	reserved	reserved	reserved	reserved
LogsHoldings	comment	non-reserved
LogsHoldings	time	non-reserved (cannot be function or type)	reserved	reserved	reserved
Media	location	non-reserved	non-reserved	non-reserved
Media	time	non-reserved (cannot be function or type)	reserved	reserved	reserved
MediaTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Nationalities	name	non-reserved	non-reserved	non-reserved	non-reserved
Participants	end	reserved	reserved	reserved	reserved
Participants	force	non-reserved
Participants	start	non-reserved	reserved	reserved
Platforms	name	non-reserved	non-reserved	non-reserved	non-reserved
PlatformTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Privacies	level	non-reserved	non-reserved	non-reserved	reserved
Privacies	name	non-reserved	non-reserved	non-reserved	non-reserved
Sensors	name	non-reserved	non-reserved	non-reserved	non-reserved
SensorTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
States	location	non-reserved	non-reserved	non-reserved
States	time	non-reserved (cannot be function or type)	reserved	reserved	reserved
Synonyms	table	reserved	reserved	reserved	reserved
Tags	name	non-reserved	non-reserved	non-reserved	non-reserved
Tasks	end	reserved	reserved	reserved	reserved
Tasks	location	non-reserved	non-reserved	non-reserved
Tasks	name	non-reserved	non-reserved	non-reserved	non-reserved
Tasks	start	non-reserved	reserved	reserved
UnitTypes	name	non-reserved	non-reserved	non-reserved	non-reserved
Users	name	non-reserved	non-reserved	non-reserved	non-reserved

List of tables which are named after keywords:

pepys.Comments

rnllv commented 3 years ago

4) Marking primary key fields as "NOT NULL" to be avoided.

All tables in pepys schema have their primary key columns marked as "NOT NULL". This is redundant and should be avoided.

For e.x., consider the DDL scripts for pepys."Activations"

CREATE TABLE pepys."Activations" (
    activation_id uuid NOT NULL,
    ....
    remarks text
);

ALTER TABLE ONLY pepys."Activations"
    ADD CONSTRAINT "pk_Activations" PRIMARY KEY (activation_id);

The activation_id column in marked as NOT NULL and PRIMARY KEY. By definition, primary key fields are not null.

The following table lists all such primary key columns which have been explicitly marked as "NOT NULL".

Table Name	Column Name	Primary Key Constraint Name
Activations	activation_id	pk_Activations
alembic_version	version_num	alembic_version_pkc
Changes	change_id	pk_Changes
ClassificationTypes	class_type_id	pk_ClassificationTypes
Comments	comment_id	pk_Comments
CommentTypes	comment_type_id	pk_CommentTypes
CommodityTypes	commodity_type_id	pk_CommodityTypes
ConfidenceLevels	confidence_level_id	pk_ConfidenceLevels
Contacts	contact_id	pk_Contacts
ContactTypes	contact_type_id	pk_ContactTypes
Datafiles	datafile_id	pk_Datafiles
DatafileTypes	datafile_type_id	pk_DatafileTypes
Extractions	extraction_id	pk_Extractions
Geometries	geometry_id	pk_Geometries
GeometrySubTypes	geo_sub_type_id	pk_GeometrySubTypes
GeometryTypes	geo_type_id	pk_GeometryTypes
HostedBy	hosted_by_id	pk_HostedBy
Logs	log_id	pk_Logs
LogsHoldings	logs_holding_id	pk_LogsHoldings
Media	media_id	pk_Media
MediaTypes	media_type_id	pk_MediaTypes
Nationalities	nationality_id	pk_Nationalities
Participants	participant_id	pk_Participants
Platforms	platform_id	pk_Platforms
PlatformTypes	platform_type_id	pk_PlatformTypes
Privacies	privacy_id	pk_Privacies
Sensors	sensor_id	pk_Sensors
SensorTypes	sensor_type_id	pk_SensorTypes
States	state_id	pk_States
Synonyms	synonym_id	pk_Synonyms
TaggedItems	tagged_item_id	pk_TaggedItems
Tags	tag_id	pk_Tags
Tasks	task_id	pk_Tasks
UnitTypes	unit_type_id	pk_UnitTypes
Users	user_id	pk_Users

rnllv commented 3 years ago

5) Case sensitive table and constraint names should be avoided

PostgreSQL follows ISO SQL standard where all unquoted identifiers are converted to lowercase and case is retained when identifiers are quoted. It's recommended to stick with lower case definitions

The only known advantage of using case-sensitive identifiers in PostgreSQL is to use same identifier in more than one place with different cases. But this practise is also frowned upon. Case-sensitive identifiers also require users/admin to quote the identifiers whenever querying these objects from their SQL client.

https://dev.to/lefebvre/dont-get-bit-by-postgresql-case-sensitivity--457i https://stackoverflow.com/questions/12076995/databases-why-case-insensitive http://sqlfiddle.com/#!17/8cf7f/4

All tables in the pepys database (except for alembic_version table) are explicitly case-sensitive.

Table Name	Change to
Activations	activations
Changes	changes
ClassificationTypes	classification_types
Comments	comments
CommentTypes	comment_types
CommodityTypes	commodity_types
ConfidenceLevels	confidence_levels
Contacts	contacts
ContactTypes	contact_types
Datafiles	datafiles
DatafileTypes	datafile_types
Extractions	extractions
Geometries	geometries
GeometrySubTypes	geometry_sub_types
GeometryTypes	geometry_types
HostedBy	hosted_by
Logs	logs
LogsHoldings	logs_holdings
Media	media
MediaTypes	media_types
Nationalities	nationalities
Participants	participants
Platforms	platforms
PlatformTypes	platform_types
Privacies	privacies
Sensors	sensors
SensorTypes	sensor_types
States	states
Synonyms	synonyms
TaggedItems	tagged_items
Tags	tags
Tasks	tasks
UnitTypes	unit_types
Users	users

rnllv commented 3 years ago

6) Using text or varchar instead of varchar(n)

The official recommendation from PostgreSQL is to use text or varchar (without size restrictions) in place of varchar(n) data types.

Restricting length with check constraints is considered far more beneficial than defining a varchar column with length restriction.

The following table lists all columns in the pepys database which uses varchar datatype with length restriction.

Table Name	Varchar Column	Length
Activations	name	150
alembic_version	version_num	32
Changes	reason	500
Changes	user	150
ClassificationTypes	name	150
CommentTypes	name	150
CommodityTypes	name	150
ConfidenceLevels	name	150
Contacts	name	150
Contacts	track_number	20
ContactTypes	name	150
Datafiles	hash	32
Datafiles	reference	150
Datafiles	url	150
DatafileTypes	name	150
Extractions	chars	150
Extractions	field	150
Extractions	table	150
Geometries	name	150
GeometrySubTypes	name	150
GeometryTypes	name	150
Logs	field	150
Logs	new_value	150
Logs	table	150
Media	url	150
MediaTypes	name	150
Nationalities	name	150
Participants	force	150
Platforms	identifier	10
Platforms	name	150
Platforms	quadgraph	4
Platforms	trigraph	3
PlatformTypes	name	150
Privacies	name	150
Sensors	name	150
SensorTypes	name	150
Synonyms	synonym	150
Synonyms	table	150
Tags	name	150
Tasks	environment	150
Tasks	location	150
Tasks	name	150
UnitTypes	name	150
Users	name	150

rnllv commented 3 years ago

7) Type casting in constraint

Though I don't have a concrete source to back this up, trials indicate that type casting in constraint have some impact to bulk insert performance. This is still an open point. Further details can be found in this StackExchange page. Thanks @IanMayo for helping with the trials performed in this regard.

The trials were performed with ~97 million records. And the performance impact seen is very minimal (matter of few seconds).

This recommendation would be invalidated by the previous one, if implemented, i.e. Using text or varchar instead of varchar(n)

The following is the list of constraints where explicit type casting to another datatype is performed.

ck_Changes_ck_Changes_reason ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name

rnllv commented 3 years ago

8) Using smallint datatype for low-cardinality columns

We can consider using smallint datatype instead of integer for the below low-cardinality columns. Read/Write performance will be enhanced due to its smaller size.

Table Name	Column Name
Datafiles	size
Nationalities	priority
Privacies	level

PostgreSQL datatype description for int types is as follows.

Name	Storage Size	Description	Range
smallint	2 bytes	small-range integer	-32768 to +32767
integer	4 bytes	typical choice for integer	-2147483648 to +2147483647

rnllv commented 3 years ago

9) Database hygiene suggestions

The following suggestions are just for database hygiene and has nothing to do with application performance.

a) Expanding columns names

We can consider expanding the following columns with proper names.

Table Name	Column Name	New Name
Contacts	ambig_bearing	bearing_ambiguity
Contacts	freq	frequency
Contacts	mla	mean_line_of_advance
Contacts	rel_bearing	relative_bearing
Contacts	soa	speed_of_advance
Extractions	chars	characters
Logs	id	log_id

~~@IanMayo, can you suggest the names where it's marked as not available <<N.A.>> ?~~

b) Changing the names of the following constraints from mixed case to lower case

ck_Changes_ck_Changes_reason ck_Changes_ck_Changes_user ck_ClassificationTypes_ck_ClassificationTypes_name ck_CommentTypes_ck_CommentTypes_name ck_Comments_ck_Comments_content ck_CommodityTypes_ck_CommodityTypes_name ck_ConfidenceLevels_ck_ConfidenceLevels_name ck_ContactTypes_ck_ContactTypes_name ck_DatafileTypes_ck_DatafileTypes_name ck_Datafiles_ck_Datafiles_hash ck_GeometrySubTypes_ck_GeometrySubTypes_name ck_GeometryTypes_ck_GeometryTypes_name ck_Logs_ck_Logs_table ck_Media_ck_Media_url ck_MediaTypes_ck_MediaTypes_name ck_Nationalities_ck_Nationalities_name ck_PlatformTypes_ck_PlatformTypes_name ck_Platforms_ck_Platforms_identifier ck_Platforms_ck_Platforms_name ck_Privacies_ck_Privacies_name ck_SensorTypes_ck_SensorTypes_name ck_Sensors_ck_Sensors_name ck_Synonyms_ck_Synonyms_synonym ck_Synonyms_ck_Synonyms_table ck_Tasks_ck_Tasks_name ck_UnitTypes_ck_UnitTypes_name ck_Users_ck_Users_name fk_Activations_privacy_id_Privacies fk_Activations_sensor_id_Sensors fk_Activations_source_id_Datafiles fk_Changes_datafile_id_Datafiles fk_Comments_comment_type_id_CommentTypes fk_Comments_platform_id_Platforms fk_Comments_privacy_id_Privacies fk_Comments_source_id_Datafiles fk_Contacts_classification_ClassificationTypes fk_Contacts_confidence_ConfidenceLevels fk_Contacts_contact_type_ContactTypes fk_Contacts_privacy_id_Privacies fk_Contacts_sensor_id_Sensors fk_Contacts_source_id_Datafiles fk_Contacts_subject_id_Platforms fk_Datafiles_datafile_type_id_DatafileTypes fk_Datafiles_privacy_id_Privacies fk_Geometries_geo_sub_type_id_GeometrySubTypes fk_Geometries_geo_type_id_GeometryTypes fk_Geometries_privacy_id_Privacies fk_Geometries_sensor_platform_id_Platforms fk_Geometries_source_id_Datafiles fk_Geometries_subject_platform_id_Platforms fk_Geometries_task_id_Tasks fk_GeometrySubTypes_parent_GeometryTypes fk_HostedBy_host_id_Platforms fk_HostedBy_privacy_id_Privacies fk_HostedBy_subject_id_Platforms fk_Logs_change_id_Changes fk_LogsHoldings_commodity_id_CommodityTypes fk_LogsHoldings_platform_id_Platforms fk_LogsHoldings_privacy_id_Privacies fk_LogsHoldings_source_id_Datafiles fk_LogsHoldings_unit_type_id_UnitTypes fk_Media_media_type_id_MediaTypes fk_Media_platform_id_Platforms fk_Media_privacy_id_Privacies fk_Media_sensor_id_Sensors fk_Media_source_id_Datafiles fk_Media_subject_id_Platforms fk_Participants_platform_id_Platforms fk_Participants_privacy_id_Privacies fk_Participants_task_id_Tasks fk_Platforms_nationality_id_Nationalities fk_Platforms_platform_type_id_PlatformTypes fk_Platforms_privacy_id_Privacies fk_Sensors_host_Platforms fk_Sensors_privacy_id_Privacies fk_Sensors_sensor_type_id_SensorTypes fk_States_privacy_id_Privacies fk_States_sensor_id_Sensors fk_States_source_id_Datafiles fk_TaggedItems_tagged_by_id_Users fk_TaggedItems_tag_id_Tags fk_Tasks_parent_id_Tasks fk_Tasks_privacy_id_Privacies pk_Activations pk_Changes pk_ClassificationTypes pk_Comments pk_CommentTypes pk_CommodityTypes pk_ConfidenceLevels pk_Contacts pk_ContactTypes pk_Datafiles pk_DatafileTypes pk_Extractions pk_Geometries pk_GeometrySubTypes pk_GeometryTypes pk_HostedBy pk_Logs pk_LogsHoldings pk_Media pk_MediaTypes pk_Nationalities pk_Participants pk_Platforms pk_PlatformTypes pk_Privacies pk_Sensors pk_SensorTypes pk_States pk_Synonyms pk_TaggedItems pk_Tags pk_Tasks pk_UnitTypes pk_Users uq_ClassificationTypes_name uq_CommentTypes_name uq_CommodityTypes_name uq_ConfidenceLevels_name uq_ContactTypes_name uq_Datafile_size_hash uq_DatafileTypes_name uq_GeometrySubType_name_parent uq_GeometryTypes_name uq_MediaTypes_name uq_Nationalities_name uq_Platform_name_nat_identifier uq_PlatformTypes_name uq_Privacies_name uq_SensorTypes_name uq_UnitTypes_name uq_Users_name

c) Defining constraints externally

It's seen that all CHECK and NOT NULL constraints are defined inline (along with the table definition) where as all the other constraints are defined externally (using ALTER TABLE ... ADD CONSTRAINT).

For the sake of consistency, it's recommended that all CHECK and NOT NULL constraint definitions are also moved externally along with the other constraint definitions.

rnllv commented 3 years ago

10) Few suggestions on PostGis indexes

The following blog has some suggestions on selecting the right indexes for PostGis datatypes and operations. As I'm not fully aware of Pepys/Debrief functionality, kindly check them to see if anything is applicable.

https://www.alibabacloud.com/blog/postgresql-best-practices-selection-and-optimization-of-postgis-spatial-indexes-gist-brin-and-r-tree_597034

debrief / pepys-import

Database Review #745

1) Don't use timestamp (without time zone)

2) Using partitioned table for pepys."States".

3) Refrain from using PostgreSQL and SQL Standard keywords as Table and Column names

4) Marking primary key fields as "NOT NULL" to be avoided.

5) Case sensitive table and constraint names should be avoided

6) Using text or varchar instead of varchar(n)

7) Type casting in constraint

8) Using smallint datatype for low-cardinality columns

9) Database hygiene suggestions

a) Expanding columns names

b) Changing the names of the following constraints from mixed case to lower case

c) Defining constraints externally

10) Few suggestions on PostGis indexes