Open-Earth-Foundation / OpenClimate-Schema

Schema for OpenClimate database
Apache License 2.0
7 stars 2 forks source link

OpenClimate Schema 1.0

This is the Schema for the OpenClimate 1.0 database. It's used by OpenClimate to store information about climate-related actors in both the public and private sector, their actions and targets, and their relationships to one another.

The schema supports importing data sets from existing sources. It also supports self-reported data in the OpenClimate interface.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Source code

The data-definition language SQL source code is available in the SQL directory. There is one .sql file per table.

Open Earth Foundation (OEF) uses PostgreSQL for hosting the database directly. Consequently, these SQL files use a PostgreSQL-inflected dialect of SQL. They should probably work with other relational database management systems, but might require some tuning.

The .sql files are functional, but their primary use is for documenting the database structure.

The scripts are not idempotent. If you run Actor.sql a second time without dropping the table or database, you will probably get an error.

Changes to the tables are made directly to each file; this makes them useful for creating a new database, but bad for updating or altering an existing database. OEF uses db-migrate to manage changes to the OpenClimate database directly.

Note that some tables have dependencies on each other because of foreign key definitions. The tables.txt file lists the tables in dependency order, so this command should work for setting up a new database:

for t in `<tables.txt`; do psql -f SQL/$t.sql; done

You may need to provide hostname, username, password, and database parameters to psql to get this work correctly!

Entity-relationship diagram

The following entity-relationship diagram (ERD) shows how the tables are structured and how they interact from a high level. See Table details for more information about each table.

Entity-relationship diagram for OpenClimate

Design principles

The schema follows a few design principles.

Table details

These categories of table are marked in the ERD with different colours. They're not functionally important; all the tables are part of the same schema and database.

Undocumented tables in the SQL subdirectory that aren't listed here are actively used and haven't been fully tested. Don't use them.

Undocumented columns in tables named below that aren't called out aren't actively used and shouldn't be used by you.

The column descriptions below are descriptive; refer to the ERD for data types and keys.

Actors

One of the big advantages of the OpenClimate schema is that actors in the climate change world are treated uniformly, regardless of their size or whether they are public or private.

Actor

An Actor is an entity that is responsible for CO2 emissions in some way. Actors include countries; sub-national regions like states and provinces; cities; private entities like corporations; and sites owned by public or private actors, like mines, farms, factories, and office buildings.

OpenClimate does not track individual human beings as Actors.

ActorIdentifier

This is the table we use to track structured identifiers for Actors in different namespaces from our default ones. It helps with harmonizing datasets, especially those that use a structured identity format. For human-readable names, see ActorName.

The rows are unique on (identifier, and namespace). For example, there is only one entry for identifier '300' in namespace 'ISO-3166-1 numeric', with actor_id = 'GR' (Greece).

To avoid complicated queries, we include the default actor_id namespace in the ActorIdentifier table, too. So, there is an ActorIdentifier row in namespace 'ISO-3166-1 alpha-2' with identifier 'GR' and actor_id 'GR'. This makes getting all the identifiers for an Actor, or looking up an actor by identifier and namespace, a simpler process, at the expense of slightly more storage and slightly ridiculous rows of data.

ActorName

A human-readable name for an Actor. Actors can have multiple names in the same language, and multiple names in different languages. Two different actors can have the same name in the same language, or in different languages.

For example, there are 21 actors with an ActorName row with name "Springfield", in the USA, Canada and South Africa. Actor names are sometimes but not always unique within a parent geographical (is_part_of) boundary.

See ActorIdentifier for structured identifiers in a namespace.

Public-sector actor

For governments, we keep information about the territory governed as well as simple population and GDP metrics.

Territory

This represents an area governed by the actor.

Population

Population history or projections of the territory of the Actor. Each row represents a year of population. The rows are unique by actor_id and year. We don't keep multiple estimates of population for the same Actor and year.

GDP

This table represents the gross domestic product history of the Actor. There is one row per actor per year, maximum. We don't keep different estimates of GDP for the same actor for one year.

Emissions

To track carbon dioxide (CO2) emissions and equivalent levels of emissions of other greenhouse gases, this part of the schema uses a few tables. The aggregate table EmissionsAgg holds annual emissions for an Actor for a year. The aggregates can be broken down by scope 1, 2, and 3 in EmissionsByScope or by sector in EmissionsBySector, or both.

EmissionsAgg

Each row in this table represents aggregate emissions of CO2 and other greenhouse gases by an actor during a single year. The rows are unique by actor_id, year, and datasource_id; there may be multiple rows for a single actor for a single year. This lets us compare different models of truth from different data sources.

EmissionsAggTag

A Tag on an EmissionsAgg row is used if there are notable features on the emissions row are different from those on the data source. For example, the methodology for an emissions data source is self-reported, but for this row, it's also validated by an auditor.

Note that this table doesn't have a datasource_id. The tag is assumed to derive from the same DataSource as the EmissionsAgg row it references.

Don't repeat tags for every row from a data source; just tag the data source.

EmissionsByScope

This table represents a breakdown of a single emissions row by scope where those breakdowns are provided. The sum of the emissions values for the different scopes should be less than or equal to the EmissionsAgg row's total, but might not due to reporting differences that aren't represented in this schema.

Note that there is no datasource_id; the DataSource is assumed to be the same as the parent EmissionsAgg row.

EmissionsBySector

A group of emissions by sector rows represents a breakdown of the total emissions by the actor according to the activity that caused the emissions, such as transportation, electricity generation, or agriculture. See Sector for a description of how sectors are represented.

The rows are unique by emissions ID and sector ID.

The sum of the emissions values for the different sectors in a single namespace should be less than or equal to the EmissionsAgg row's total, but discrepancies might occur due to reporting differences that aren't represented in this schema.

Note that there is no datasource_id; the DataSource is assumed to be the same as the parent EmissionsAgg row.

Sector

Each row represents a sector of activity that produces CO2 or equivalent greenhouse gases. Different reporting tools and regulatory agencies track sectors of activity slightly differently, so this table allows a namespace to control uniqueness of the rows.

EmissionsBreakdown

This table represents a breakdown of the total emissions by the actor according to the emissions scope, activity that caused the emissions, and the greenhouse gases included in the emissions reported, whenever the details are made available. The rows are unique by emissions ID, sector ID, scope, and greenhouse gases included.

The sum of the emissions values for the different breakdowns should be more or less equal to the EmissionsAgg row's total, but discrepancies might occur due to reporting differences that aren't represented in this schema.

Targets

This cluster of tables represents targets for emissions reductions or similar mitigations for climate change.

Target

A Target is a goal set by the Actor, often as part of a treaty or other agreement with similar Actors, but sometimes a voluntary reduction.

Targets are usually unique for a target type, target year, target unit, and actor, but there may be exceptions.

TargetTag

Targets that have properties that aren't well-represented by the standard table columns can add Tags to the rows for the target to represent those properties. For example, the nationally-defined commitments (NDCs) of the Paris Agreement are often marked as "conditional" (won't happen without financial aid) or "unconditional" (will happen with or without aid). Since that aspect is specific to the Paris Agreement, we represent the property with a tag.

Note that this table doesn't have a datasource_id. The tag is assumed to derive from the same DataSource as the Target row it references.

TargetTag rows provide a many-to-many relationship between Target rows and Tag rows.

Don't repeat tags for every row from a data source; just tag the data source.

Metadata

To track data provenance, we use a number of tables related to data sets and publishers.

DataSource

A DataSource is a single version of a dataset or document used to derive data for this database. Multiple versions of the same publication have different DataSource rows.

To the extent that datasources may come from aggregated or edited secondary sources, we tend to name the secondary source instead of the primary source. This makes tracking down errors and updates easier, to the detriment of giving links and credit to the primary source.

DataSourceTag

A Tag is a property of a data set, such as the methodology used for collecting or calculating the data, or the data license, or other ways of identifying data sets.

This table provides a many-to-many relationship between DataSource and Tag.

Publisher

A data or document publisher. This table lets us collect DataSource rows produced by the same organization into a related set.

Tag

A tag for a row, so we can have some extra data about it that isn't captured in the columns.

Tag rows are joined to rows in other tables by a many-to-many junction table. See DataSourceTag or EmissionsAggTag for examples.

See https://en.wikipedia.org/wiki/Tag_(metadata) for how tags can be used.

Comments and questions

Comments, questions and suggestions for this schema are tracked in the Open-Earth-Foundation/OpenClimate-Schema repo issues.