Changes to contribution table

matamadio commented 3 years ago

The contribution table is present in all schemas and includes some of the key information of the dataset. The previous version had 4 identical but schema-specific tables which also had few overlaps within specific schema attributes. I tried my best to have something more simple and efficient, storing all general info about a dataset, with the following changes:

moved "contribution" table to common schema, removed from subschemas. Second field become general "set_id" instead of eg "event_set_id".
added "component" field to common.contribution table to specify which schema refers to; refers to common.component_enum (Hazard, Exposure, Vulnerability, Loss). (handy to specify which single schema to use on one dataset import; also for JKAN implementation)
added simple bibliography field (author, title) to contribution table; if there is a paper/report related to one or more datasets, it is relevant to indicate at higher hierarchy. The hazard.event_set table had this, I thought it was a good idea to extend it to all components. The V schema has two similar but much more specific fields, impe_reference and dm_scale_reference, referring to a third reference table (author_year; title; issn; doi). I did not change those. Suggest to simplify and uniform the V schema (one or more relevant studies to be added in contribution.bibliograpy).
added geo_coverage field (comma separated ISO3 codes) to contribution table; it is an information relevant to the whole dataset and for all schema. Removed "country_iso" from "mover.f_core"; "geographic_area_name" made optional in hazard.event_set (for local-scale datasets)
removed "created at" field from V schema f.core table (as already in contribution)
added "publish" boolean in cf_common.contribution for show/hide datasets
added cf.common_iso to common types, includes 232 countries "code" and "name"

Old contribution table

Required	Field name	Type	Reference table	Description
*	event_set_id	INT		Unique number ID of event_set
*	model_source	VARCHAR		Name of source model
*	model_date	DATE		Model release date
	notes	TEXT		Details about the dataset
	version	VARCHAR		Version of the dataset
	purpose	TEXT		Purpose for what the data has been produced
	project	VARCHAR		Project under which data has been produced
*	contributed_at	timestamp		Date of contribution
*	license_code	ENUM	cf_common.license	Type of license

New contribution table

Required	Field name	Type	Reference table	Description
*	component	ENUM	common.component.enum	Schema to be used (H, E, V, L)
*	set_id	INT		Unique number ID
*	model_source	VARCHAR		Name of source model
*	model_date	DATE		Model release date
	notes	TEXT		Details about the dataset
	version	VARCHAR		Version of the dataset
	purpose	TEXT		Purpose for what the data has been produced
	project	VARCHAR		Project under which data has been produced
	bibliography	TEXT		Title and authors of studies containing relevant information
*	geo_coverage	ENUM	cf_common.iso	ISO code(s) of countries covered by the dataset, comma-separated
*	contributed_at	timestamp		Date of contribution
*	publish	BOOLEAN		Flag to show/hide dataset from website
*	license_code	ENUM	cf_common.license	Type of license

matamadio commented 3 years ago

Review schema and optimisation

stufraser1 commented 3 years ago

These make sense to simplify and consolidate contribution and resolve the issue of inconsistent ISO codes and overlap of contribution information in MOVER.

We need to make sure that if a user wants to implement just 1-2 of the schema, not all, then this structure will work -- but I think they would have to replicate cf_common in that case, so there should be no issue there. Agree?

We should clearly define the content of model_source - we tend to refer to datasets, rather than model. Please review original technical documents to ensure we describe it as intended.

Outstanding in this is the question of notes+purpose, vs using abstract (which is more common and aligns better with existing metadata standards.) The intention of notes+purpose was to constrain the information being accepted, but practically this lack of alignment with abstract field in metadata standards may be an issue that overrides this intention.

stufraser1 commented 3 years ago

Comment on bibliography field: Should we consider a new table to give more details on the publication (as already given in MOVER) for all other data types? Should we consider potential for >1 report/paper to be associated with a dataset - in which case bibliography must contain >1 author-year references

matamadio commented 3 years ago

We need to make sure that if a user wants to implement just 1-2 of the schema, not all, then this structure will work -- but I think they would have to replicate cf_common in that case, so there should be no issue there. Agree?

Yep, the schema would go: contribution (common) + schema attributes (specific). So the table is shared instead of duplicated, but in practice the result is the same.

We should clearly define the content of model_source - we tend to refer to datasets, rather than model. Please review original technical documents to ensure we describe it as intended.

I have to read trhough original docs for many fields that I can't collocate rightly in the practice, and fill the examples. For example the AFG dataset, what would be the model source? I just put disasterrisk.af, in lack of better info.

Outstanding in this is the question of notes+purpose, vs using abstract (which is more common and aligns better with existing metadata standards.) The intention of notes+purpose was to constrain the information being accepted, but practically this lack of alignment with abstract field in metadata standards may be an issue that overrides this intention.

Agreed. A more general "abstract" or "description" fits better with existing schemas and data. Notes+Purpose is more detailed but also misunderstandable. I am using notes field now in JKAN for abstract.

Comment on bibliography field: Should we consider a new table to give more details on the publication (as already given in MOVER) for all other data types? Should we consider potential for >1 report/paper to be associated with a dataset - in which case bibliography must contain >1 author-year references

I'd compromise, with two short biblio fields in contribution table, that allow multiple entries.

biblio_auth_title: [author(s)A, titleA, yearA]; [author(s)B, titleB, yearB];
biblio_url: [link to publicationA]; [link to publicationB]

matamadio commented 3 years ago

Superceeded by #40

GFDRR / rdl-data

Changes to contribution table #35