C1. iterate on activity to dataset relationships

zednis commented 9 years ago

Check entries under http://data.globalchange.gov/activity and create terms accordingly, relating to other datasets.

Need something for

[ ] computing environment
[ ] data usage
[ ] names of input/output data files
[ ] etc. (?)

Maybe relate to pre-existing dbpedia entries in use at the turtle there.

See https://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process.ttl for an example of how we have tackled this problem to date.

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix dbpedia_owl: <http://dbpedia.org/ontology/> .
@prefix gcis: <http://data.globalchange.gov/gcis.owl#> .
@prefix meth: <http://sweet.jpl.nasa.gov/2.3/reprSciMethodology.owl#> .

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>
   dcterms:identifier "0158fa86-nca3-ghcn-daily-r201305-process";
   dcterms:description "Decadal average anomalies for the 99th percentile of precipitation (difference between the decade and the 1901-1960 average precipitation) for the Northwest region were plotted as a bar graph. Note: the far right bar contains data for 12 years (2001-2012)."^^xsd:string;

## The activity began and ended at the following times

## Duration of the activity
   dcterms:SizeOrDuration "6 hours"^^xsd:string;

## Output datafiles   
   dbpedia_owl:filename "getprcpextremes99perc_May03_2013.f95\r\nprcpextremes99perc_1901_2012_May03_2013.txt \r\ngridaverage_99per_regions_May03_2013.f95\r\ngridprcpextremes99perc_regions_1901_2012_May03_2013.txt\r\ngridprecipextremes99perc_regions_1901_2012_May03_2013.csv\r\nperc99_decadal_barchart.pro\r\n99th_perc_anom_decadal_values_1901-2012.txt\r\nNW_99th_perc_anom_pct_1901_2012.eps\r\n2-17_nw.png\r\nCS_Extreme Heavy precipitation_v7.png"^^xsd:string;

## Software utilized
   gcis:Software "Fortran 95; lf95 compiler (Lahey/Fujitsu Linux64 Fortran compiler release L8.10b); IDL (version 8.0)"^^xsd:string;

## Computing environment
   dcterms:InteractiveResource "Linux (CentOS release 6.4); Mac OS X (darwin x86_64 m64)"^^xsd:string;

## Methodology employed
   meth:Methodology "First, GHCN-D stations with minimal (less than 10%) missing precipitation data were identified. For each station the 99th percentile threshold of daily precipitation was determined using the entire period of record. Next, the total precipitation falling on days exceeding the 99th percentile threshold was calculated for each year. Then, grid box average values were calculated for each year, by averaging the values for each station available in that grid box. The annual values were then averaged for all grid boxes containing data in the Northwest region. Decadal averages were then calculated. Finally the 1901-1960 99th percentile average amount was subtracted from the decadal average amount, and a percentage change was calculated."^^xsd:string;

   a prov:Activity .

## The following entity was derived from a dataset using this activity
<http://data.globalchange.gov/image/0158fa86-481b-4a0b-8a79-4fd56b553cfd>
   a gcis:Image;
   prov:wasDerivedFrom <http://data.globalchange.gov/dataset/nca3-ghcn-daily-r201305>;
   prov:wasGeneratedBy <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>.

zednis commented 9 years ago

I don't think suggestion is at the point where we can utilize the change proposal template. I suggest we discuss and refine the issue first.

zednis commented 9 years ago

@justgo129 looking at the example instance RDF from this ticket, there are a number of issues that we should address.

1) dcterms:SizeOrDuration is a class and it is being used as a datatype property. 2) gcis:Software is a class and it is being used as a datatype property. 3) dcterms:InteractiveResource is a class and is being used as a datatype property. 4) meth:Methodology is a class and is being used as a datatype property.

In each case this is an incorrect usage of the class.

zednis commented 9 years ago

I have a feeling we can use PROV to guide our relationships here. We have already updated the instance data to use a PROV qualified association with the software agent and the methodology (as a plan). See #9.

zednis commented 9 years ago

Here is what the example instance data looks like after the recent updates to the RDF-generation templates:

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix dbpedia_owl: <http://dbpedia.org/ontology/> .
@prefix gcis: <http://data.globalchange.gov/gcis.owl#> .
@prefix meth: <http://sweet.jpl.nasa.gov/2.3/reprSciMethodology.owl#> .

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>
   dcterms:identifier "0158fa86-nca3-ghcn-daily-r201305-process";
   dcterms:description "Decadal average anomalies for the 99th percentile of precipitation (difference between the decade and the 1901-1960 average precipitation) for the Northwest region were plotted as a bar graph. Note: the far right bar contains data for 12 years (2001-2012)."^^xsd:string;

## The activity began and ended at the following times

## Duration of the activity
   dcterms:extent [ rdf:value "6 hours"^^xsd:string; ] ;

## Output datafiles   
   dbpedia_owl:filename "getprcpextremes99perc_May03_2013.f95\r\nprcpextremes99perc_1901_2012_May03_2013.txt \r\ngridaverage_99per_regions_May03_2013.f95\r\ngridprcpextremes99perc_regions_1901_2012_May03_2013.txt\r\ngridprecipextremes99perc_regions_1901_2012_May03_2013.csv\r\nperc99_decadal_barchart.pro\r\n99th_perc_anom_decadal_values_1901-2012.txt\r\nNW_99th_perc_anom_pct_1901_2012.eps\r\n2-17_nw.png\r\nCS_Extreme Heavy precipitation_v7.png"^^xsd:string;

## Computing environment
   dcterms:InteractiveResource "Linux (CentOS release 6.4); Mac OS X (darwin x86_64 m64)"^^xsd:string;

   prov:qualifiedAssociation [
      a prov:Association ;
      prov:agent [ 
         a prov:SoftwareAgent, gcis:Software ;
         rdfs:label "Fortran 95; lf95 compiler (Lahey/Fujitsu Linux64 Fortran compiler release L8.10b); IDL (version 8.0)"^xsd:string ;
      ] ;
      prov:hadPlan [
         a prov:Plan, meth:Methodology ;
         rdf:value "First, GHCN-D stations with minimal (less than 10%) missing precipitation data were identified. For each station the 99th percentile threshold of daily precipitation was determined using the entire period of record. Next, the total precipitation falling on days exceeding the 99th percentile threshold was calculated for each year. Then, grid box average values were calculated for each year, by averaging the values for each station available in that grid box. The annual values were then averaged for all grid boxes containing data in the Northwest region. Decadal averages were then calculated. Finally the 1901-1960 99th percentile average amount was subtracted from the decadal average amount, and a percentage change was calculated."^^xsd:string;
      ] ;
   ] ;

   a prov:Activity .

## The following entity was derived from a dataset using this activity
<http://data.globalchange.gov/image/0158fa86-481b-4a0b-8a79-4fd56b553cfd>
   a gcis:Image;
   prov:wasDerivedFrom <http://data.globalchange.gov/dataset/nca3-ghcn-daily-r201305>;
   prov:wasGeneratedBy <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>.

In this example we still need to look at the modeling of:

[ ] activity output files
[ ] computing environment
[ ] possible classes to use for value of dcterms:extentrelationship
[ ] label on the software (is this more than 1 software instance?)

zednis commented 9 years ago

I think we should represent the output files as PROV entities and reference them using prov:generated.

Here is an example of doing that where we do not have to mint a URI for each output file.

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>
  a prov:Activity ;
  prov:generated [
    a prov:Entity, gcis:File ;
    rdf:label "getprcpextremes99perc_May03_2013.f95"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdfs:label "prcpextremes99perc_1901_2012_May03_2013.txt"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "gridaverage_99per_regions_May03_2013.f95"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "gridprcpextremes99perc_regions_1901_2012_May03_2013.txt"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "gridprecipextremes99perc_regions_1901_2012_May03_2013.csv"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "perc99_decadal_barchart.pro"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "99th_perc_anom_decadal_values_1901-2012.txt"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "NW_99th_perc_anom_pct_1901_2012.eps"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "2-17_nw.png"^^xsd:string ;
  ],
  [
    a prov:Entity, gcis:File ;
    rdf:label "CS_Extreme Heavy precipitation_v7.png"^^xsd:string ;
  ]  ;
.

and here is an example where we mint a URI for each output file:

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>
  a prov:Activity ;
  prov:generated 
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/getprcpextremes99perc_May03_2013.f95> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/prcpextremes99perc_1901_2012_May03_2013.txt> , 
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/gridaverage_99per_regions_May03_2013.f95> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/gridprcpextremes99perc_regions_1901_2012_May03_2013.txt> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/gridprecipextremes99perc_regions_1901_2012_May03_2013.csv> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/perc99_decadal_barchart.pro> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/99th_perc_anom_decadal_values_1901-2012.txt> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/NW_99th_perc_anom_pct_1901_2012.eps> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/2-17_nw.png> ,
    <http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/CS_Extreme_Heavy_precipitation_v7.png> .

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process/output/getprcpextremes99perc_May03_2013.f95>
  a prov:Entity, gcis:File ;
  rdfs:label "getprcpextremes99perc_May03_2013.f95"^^xsd:string .

# ...

justgo129 commented 9 years ago

I like this approach but it will also require creating a term for gcis:File since the ontology currently lacks it. Actually, doing so may inform the turtle for "files," e.g.: https://data.globalchange.gov/file/9bdd68ab-4b90-4ad9-b043-fedb7e564184.thtml

Note that we use fabio:ComputerFile here. If we go with the suggestion of @zednis could we relate the new gcis:File with fabio:ComputerFile to increase robustness and foster queries?

For the specific topic on hand, yes I am fine with the first solution since the filenames in the URI minting are not first class objects.

zednis commented 9 years ago

@justgo129 I think it would be fine to use fabio:ComputerFile directly in the instance data.

edit: so in my example replace gcis:File with fabio:ComputerFile.

bduggan commented 9 years ago

I think we do not want URIs in GCIS for these individual files. These files are actually often not available to us and we don't want to take on the role of long term curation of them. They are 'owned' by the organization/people that produced them (hi, @abuddenb), and I think storing coarse information that describes them (as provided by whoever generated them) is the best we can do.

I'm even a little hesitant to split the field up into multiple filenames due to the way we are collecting this information: this is a free form field, output_artifacts, that is not restricted to a carriage return-separated list of output files.

Converting this narrative free form list into structured information is a bigger problem; it's nice that there are some examples like this with lists of filenames but the issue of representing artifacts generated during data production is going to be more involved.

zednis commented 9 years ago

Then perhaps we should remove that information from the RDF.

bduggan commented 9 years ago

I suppose, unless there is nice way to describe a collection of labels for output artifacts generated by an activity?

zednis commented 9 years ago

The preferred way to describe a collection of labels for output artifacts in PROV was the posted example.

edit: there are no datatype properties in PROV to reference entities as strings and there are no datatype properties to reference delimited collections of labels. If you want to record an input or output to an activity you make it an entity and in RDF that means it has to be a resource.

bduggan commented 9 years ago

That representation (the first one) looks good but splitting the field up into filenames consistently (and identifying that the field contains filenames) may be a challenge (e.g. see this example, which has labels like "Data Files:" in the field, so we can't just split on \n\r)

zednis commented 9 years ago

Yeah, that was a problem with the example I used as well as there was one line with two files delimited by a space and a different file on another line that included a space in the filename.

If we want to proceed with these output files in the RDF going forward there may have to be updates to the form used to gather the data.

abuddenb commented 9 years ago

I think we do not want URIs in GCIS for these individual files.

You do not. Those files are ephemeral; what's the point of assigning an identifier to an object that is likely inaccessible, outdated, or missing. On the other hand, if we could point to a commit / release in a version control system, that would be more durable--and it comes with its own identifier.

Yeah, that was a problem with the example I used as well as there was one line with two files delimited > by a space and a different file on another line that included a space in the filename.

Perhaps this data should be collected in a more structured manner?

bduggan commented 9 years ago

On Wednesday, June 17, Andrew Buddenberg wrote:

Perhaps this data should be collected in a more structured manner?

Yes, agreed, with notions of input files, output files, programs, URLs, versions (commits?) etc.

The existing representation is much much better than nothing but is still basically a narrative blob; it is a decent first cut.

rewolfe commented 9 years ago

+1: Narrative blob

Let's leave this at a more general level and not try to get too much detail. We already have trouble collecting activity information and don't need to make it more difficult.

On Wed, Jun 17, 2015 at 11:34 AM, Brian Duggan notifications@github.com wrote:

On Wednesday, June 17, Andrew Buddenberg wrote:

Perhaps this data should be collected in a more structured manner?

Yes, agreed, with notions of input files, output files, programs, URLs, versions (commits?) etc.

The existing representation is much much better than nothing but is still basically a narrative blob; it is a decent first cut.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/7#issuecomment-112849756.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

zednis commented 9 years ago

In that case I would recommend a new datatype property in the gcis ontology such as gcis:descriptionOfOutputFiles. Make it clear this is narrative and a literal value that describes a collection of files.

zednis commented 9 years ago

For computing environment, I found the property dbpedia-owl:computingPlatform that I think is close to what we want to say - but it is an object property so we would need to use it with resources and not string values. Also, the domain of the property is dbpedia-owl:Software, so we would want to associate it not with the activity but with the software agent associated with the activity.

http://dbpedia.org/ontology/computingPlatform

edit: I think it would end up looking like this

<http://data.globalchange.gov/activity/0158fa86-nca3-ghcn-daily-r201305-process>
   dcterms:identifier "0158fa86-nca3-ghcn-daily-r201305-process";
   dcterms:description "..."^^xsd:string;

   prov:qualifiedAssociation [
      a prov:Association ;
      prov:agent [ 
         a prov:SoftwareAgent, gcis:Software, dbpedia-owl:Software ;
         rdfs:label "IDL (version 8.0)"^xsd:string ;
         dbpedia-owl:computingPlatform [ rdfs:label "Mac OS X (darwin x86_64 m64)"^^xsd:string; ] ;
      ] ;
      prov:hadPlan [
         a prov:Plan, meth:Methodology ;
         rdf:value "..."^^xsd:string;
      ] ;
   ] ;

   a prov:Activity .

note - I removed the fortran and linux values from the agent and computing platform descriptions.

I think to reliably capture multiple agents and their associated computing platforms we will have to collect this information in a more structured manner.

bduggan commented 9 years ago

On Wednesday, June 17, Stephan Zednik wrote:

In that case I would recommend a new datatype property in the gcis ontology such as gcis:descriptionOfOutputFiles. Make it clear this is narrative and a literal value that describes a collection of files.

Sounds good.

justgo129 commented 9 years ago

Closed #7 due to the merges listed above.

USGCRP / gcis-ontology

C1. iterate on activity to dataset relationships #7