GFDRR / rdl-data

Challenge Fund Database combining Hazard, Exposure, Loss and Vulnerability schema into a single database
GNU Affero General Public License v3.0
1 stars 2 forks source link

Changes to contribution table #35

Closed matamadio closed 3 years ago

matamadio commented 3 years ago

The contribution table is present in all schemas and includes some of the key information of the dataset. The previous version had 4 identical but schema-specific tables which also had few overlaps within specific schema attributes. I tried my best to have something more simple and efficient, storing all general info about a dataset, with the following changes:

Old contribution table

Required Field name Type Reference table Description
* event_set_id INT   Unique number ID of event_set
* model_source VARCHAR   Name of source model
* model_date DATE   Model release date
  notes TEXT   Details about the dataset
  version VARCHAR   Version of the dataset
  purpose TEXT   Purpose for what the data has been produced
  project VARCHAR   Project under which data has been produced
* contributed_at timestamp   Date of contribution
* license_code ENUM cf_common.license Type of license

New contribution table

Required Field name Type Reference table Description
* component ENUM common.component.enum  Schema to be used (H, E, V, L)
* set_id INT   Unique number ID
* model_source VARCHAR   Name of source model
* model_date DATE   Model release date
  notes TEXT   Details about the dataset
  version VARCHAR   Version of the dataset
  purpose TEXT   Purpose for what the data has been produced
  project VARCHAR   Project under which data has been produced
  bibliography TEXT   Title and authors of studies containing relevant information
* geo_coverage ENUM cf_common.iso ISO code(s) of countries covered by the dataset, comma-separated
* contributed_at timestamp   Date of contribution
* publish BOOLEAN   Flag to show/hide dataset from website
* license_code ENUM cf_common.license Type of license
matamadio commented 3 years ago

Review schema and optimisation

stufraser1 commented 3 years ago

These make sense to simplify and consolidate contribution and resolve the issue of inconsistent ISO codes and overlap of contribution information in MOVER.

We need to make sure that if a user wants to implement just 1-2 of the schema, not all, then this structure will work -- but I think they would have to replicate cf_common in that case, so there should be no issue there. Agree?

We should clearly define the content of model_source - we tend to refer to datasets, rather than model. Please review original technical documents to ensure we describe it as intended.

Outstanding in this is the question of notes+purpose, vs using abstract (which is more common and aligns better with existing metadata standards.) The intention of notes+purpose was to constrain the information being accepted, but practically this lack of alignment with abstract field in metadata standards may be an issue that overrides this intention.

stufraser1 commented 3 years ago

Comment on bibliography field: Should we consider a new table to give more details on the publication (as already given in MOVER) for all other data types? Should we consider potential for >1 report/paper to be associated with a dataset - in which case bibliography must contain >1 author-year references

matamadio commented 3 years ago

We need to make sure that if a user wants to implement just 1-2 of the schema, not all, then this structure will work -- but I think they would have to replicate cf_common in that case, so there should be no issue there. Agree?

Yep, the schema would go: contribution (common) + schema attributes (specific). So the table is shared instead of duplicated, but in practice the result is the same.

We should clearly define the content of model_source - we tend to refer to datasets, rather than model. Please review original technical documents to ensure we describe it as intended.

I have to read trhough original docs for many fields that I can't collocate rightly in the practice, and fill the examples. For example the AFG dataset, what would be the model source? I just put disasterrisk.af, in lack of better info.

Outstanding in this is the question of notes+purpose, vs using abstract (which is more common and aligns better with existing metadata standards.) The intention of notes+purpose was to constrain the information being accepted, but practically this lack of alignment with abstract field in metadata standards may be an issue that overrides this intention.

Agreed. A more general "abstract" or "description" fits better with existing schemas and data. Notes+Purpose is more detailed but also misunderstandable. I am using notes field now in JKAN for abstract.

Comment on bibliography field: Should we consider a new table to give more details on the publication (as already given in MOVER) for all other data types? Should we consider potential for >1 report/paper to be associated with a dataset - in which case bibliography must contain >1 author-year references

I'd compromise, with two short biblio fields in contribution table, that allow multiple entries.

biblio_auth_title: [author(s)A, titleA, yearA]; [author(s)B, titleB, yearB];
biblio_url: [link to publicationA]; [link to publicationB]
matamadio commented 3 years ago

Superceeded by #40