USGCRP / gcis-conventions

Repository for the collection, management, and versioning of the GCIS data management conventions.
https://usgcrp.github.io/gcis-conventions/
1 stars 0 forks source link

Dataset Conventions Discussion #17

Closed lomky closed 5 years ago

lomky commented 6 years ago

A ticket to discuss the conventions surrounding Dataset.

Current Dataset Conventions (Blank).

Dataset Fields:

         Column          |  Database Description
-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 identifier              | A globally unique identifier for this dataset.  This may be a composite identifier derived from external identifier or publications associated with this dataset.
 name                    | A brief descriptive name.
 type                    | A free form type for this dataset.
 version                 | The version.
 description             | A narrative description of this dataset.  If the description is a direct quote available at a URL, put that URL into description_attribution.
 native_id               | The identifier for this dataset given by the producer or archive for the dataset.
 access_dt               | The data on which this dataset was accessed.
 url                     | A URL for a landing page.
 data_qualifier          | Assumptions or qualifying statements about this data.
 scale                   | If the data has been scaled, describe that here.
 spatial_ref_sys         | The spatial reference system.
 cite_metadata           | The preferred way to cite this dataset.
 scope                   | The scope of the data.
 spatial_extent          | Brief description of the spatial extent, which corresponds to lat_min/lat_max, lon_min/lon_max
 temporal_extent         | Brief description of the temporal extent, which corresponds to start_time/end_time
 vertical_extent         | A brief description of the vertical extent.
 processing_level        | The processessing level, if applicable.
 spatial_res             | The spatial resolution.
 doi                     | A digital object identifier.
 release_dt              | The date on which this dataset was released.
 publication_year        | The date on which this dataset was initially published.
 attributes              | Free form comma separated attributes for this dataset.
 variables               | Variables represented by this dataset.
 start_time              | The beginning of the temporal extent.
 end_time                | The end of the temporal extent.
 lat_min                 | The southernmost latitude in the bounding box for this dataset.
 lat_max                 | The nothernmost latitude in the bounding box for this dataset.
 lon_min                 | The westernmost longitude in the bounding box for this dataset.
 lon_max                 | The eastermost longitude in the bounding box for this dataset.
 description_attribution | A URL which contains a description of this dataset.

Provenance Connections:

figure "prov:wasDerivedFrom" dataset, through activity

Relationships:

contributors
files
gcmd_keywords
regions
lomky commented 6 years ago

it's a bit of a tangled situation.

Issues to handle:

lomky commented 6 years ago

Issue: insufficient provenance fields

Good fields

Harmful fields

Versions of Dataset

lomky commented 6 years ago

Field Breakdown by Corrected Ownership

GCIS Dataset

identifier  
name                   
version                
description            
native_id              
url                    
doi                    
release_dt             
publication_year       
description_attribution

External Dataset

These are the business of the dataset itself, GCIS has no stake in them.

type                   
data_qualifier         
spatial_ref_sys        
cite_metadata          
scope                  
spatial_extent         
temporal_extent        
vertical_extent        
processing_level       
spatial_res            
start_time             
end_time               
lat_min                
lat_max                
lon_min                
lon_max                

Activity

These fields have more to do with how a dataset was used than with the dataset itself or with the provenance metadata. Generally they should go on the Activity in GCIS.

access_dt
scale

If they used a subset of a dataset, they may have:

spatial_extent         
temporal_extent        
vertical_extent  
start_time             
end_time               
lat_min                
lat_max                
lon_min                
lon_max                

Should not exist

We either do not know what this field is for, or the fields are catch-alls that are not useful.

scale                  
attributes             
variables  
lomky commented 6 years ago

identifier

name

version

description

native_id

url

doi

release_dt

publication_year

description_attribution

rasherman commented 6 years ago

Made a couple of typo fixes, looks great to me.

R-Aniekwu commented 5 years ago

Emergent GCIS dataset-related questions:

• In cases where a prospective dataset has many (more than 2) contributing organizations/dataset producers, no clear lead organization, and no citation documentation, how does the GCIS dataset name convention fair? For example, consider this dataset/data archive at https://gdo-dcp.ucllnl.org/downscaled_cmip_projections/dcpInterface.html#Links

• Besides being imported from “data.gov”, are there other reasons why we should/could consider using a dataset’s original identifier, especially when the pertinent dataset lacks a DOI?

• When should we use a parent organization name rather than a subsidiary name in our dataset name/identifier convention? For example, DOE as the prefix, rather than LLNL or EIA (the link in the first comment has LLNL as a data developer - the privacy and legal notice also links to LLNL's official website. Yet, the dataset website ends in ".org" and not ".gov").

R-Aniekwu commented 5 years ago

The dataset identifier that the previous comments applied to seem fine. https://data-stage.globalchange.gov/dataset/ucllnl-downscaled-cmip3-cmip5-climate-hydrology-projections. No need for any revision.

I have reviewed recently created datasets, and the identifiers also seem fine. I will continue the review process indefinitely.