USGCRP / gcis-ontology

Ontology for the Global Change Information System
4 stars 7 forks source link

Generate report comparing launch dates in GCIS vs dbpedia for same platforms #117

Closed zednis closed 9 years ago

zednis commented 9 years ago

Compare the launch dates of platforms in GCIS (i.e. from CEOS) to launch dates from dbpedia.

zednis commented 9 years ago

Query to get launch dates for platforms from GCIS. Next I will update the query to compare with launch dates from dbpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?s ?launchDate ?deactivated
FROM <http://data.globalchange.gov>
WHERE {
  ?s a gcis:Platform .
  OPTIONAL { ?s dbpprop:deactivated ?deactivated }
  OPTIONAL { ?s dbpprop:launchDate ?launchDate }
} ORDER BY ?launchDate
bduggan commented 9 years ago

Great, thanks, I'm adding this to the (newly created) gcis-sparql repo:

https://github.com/USGCRP/gcis-sparql

I'll send an email (outside this ticket) about this repo.

Brian

zednis commented 9 years ago

I have updated my query to select the dbpedia URI for the matching instance. I will then be able to write use a federated query to retrieve the launch date for the platform from dbpedia.

SELECT ?s ?match ?launchDate ?deactivated
WHERE {
  SERVICE <https://data.globalchange.gov/sparql> {
    ?s a gcis:Platform .
    ?s skos:exactMatch ?match .
    ?match skos:inScheme <http://data.globalchange.gov/lexicon/dbpedia> .
    OPTIONAL { ?s dbp:deactivated ?deactivated }
    OPTIONAL { ?s dbp:launchDate ?launchDate }
  }
} ORDER BY ?launchDate

I have run into an issue where I am unable to retrieve the value of skos:inScheme from the lexicon concept from the triplestore. (above query returns 0 results)

see http://data.globalchange.gov/lexicon/dbpedia.thtml for example in REST API.

This statement is generated by the representation.ttl.tut template.

@bduggan is it possible RDF from this template is not being included in the triplestore load?

zednis commented 9 years ago

I have updated my query to use the a owl:sameAs and a regex filter.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX db: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT
?platform_gcis
?platform_dbpedia 
?launchDate_gcis
?launchDate_dbpedia 
?cospar_dbpedia
WHERE {
  FILTER(str(?match) = str(?platform_dbpedia))
  SERVICE <https://data.globalchange.gov/sparql> {
    ?platform_gcis a gcis:Platform .
    ?platform_gcis owl:sameAs ?match .
    ?platform_gcis dbp:launchDate ?launchDate_gcis
    FILTER regex(?match, "dbpedia.org", "i") .
  }
  SERVICE <http://dbpedia.org/sparql> {
    ?platform_dbpedia dbp:launchDate ?launchDate_dbpedia .
    OPTIONAL { ?platform_dbpedia dbp:cosparId ?cospar_dbpedia }
  }
} LIMIT 20

without a limit on the result set size the query times out.

The launch dates in dbpedia seem to generally be missing the year component. Looking at example RDF on dbpedia the dbp:cosparId property is often (but not always) used to contain the launch year.

results:

"platform_gcis" "platform_dbpedia" "launchDate_gcis" "launchDate_dbpedia" "cospar_dbpedia"
"http://data.globalchange.gov/platform/aqua" "http://dbpedia.org/resource/Aqua_(satellite)" "2002-05-04T00:00:00-06:00" "--05-04" "2002"
"http://data.globalchange.gov/platform/suomi-national-polar-orbiting-partnership" "http://dbpedia.org/resource/Suomi_NPP" "2011-10-28T00:00:00-06:00" "--10-28" "2011"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-2" "http://dbpedia.org/resource/GOES_2" "1984-02-26T00:00:00-06:00" "--06-16" "1977"
"http://data.globalchange.gov/platform/uk-disaster-monitoring-constellation-2" "http://dbpedia.org/resource/UK-DMC_2" "2009-07-29T00:00:00-06:00" "--07-29" "2009"
"http://data.globalchange.gov/platform/vnredsat-1" "http://dbpedia.org/resource/VNREDSat_1A" "2013-05-07T00:00:00-06:00" "--05-07" "2013"
"http://data.globalchange.gov/platform/quick-scatterometer" "http://dbpedia.org/resource/QuikSCAT" "1999-06-19T00:00:00-06:00" "--06-19" "1999"
"http://data.globalchange.gov/platform/communication-oceanographic-meteorological-satellite" "http://dbpedia.org/resource/Chollian" "2010-06-26T00:00:00-06:00" "--06-26" "2010"
"http://data.globalchange.gov/platform/automatic-identification-system-satellite-1" "http://dbpedia.org/resource/AISSat-1" "2010-07-12T00:00:00-06:00" "--07-12" "2010"
"http://data.globalchange.gov/platform/resource-satellite-2" "http://dbpedia.org/resource/Resourcesat-2" "2011-04-20T00:00:00-06:00" "2011-04-20" "2011"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-11" "http://dbpedia.org/resource/GOES_11" "2000-05-03T00:00:00-06:00" "--05-03" "2000"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-6" "http://dbpedia.org/resource/GOES_6" "1984-02-26T00:00:00-06:00" "--04-28" "1983"
"http://data.globalchange.gov/platform/bilsat-research-satellite" "http://dbpedia.org/resource/BILSAT-1" "2003-09-01T00:00:00-06:00" "--09-27" "2003"
"http://data.globalchange.gov/platform/advanced-land-observing-satellite" "http://dbpedia.org/resource/Advanced_Land_Observation_Satellite" "2006-01-24T00:00:00-06:00" "--01-24" "2006"
"http://data.globalchange.gov/platform/landsat-7" "http://dbpedia.org/resource/Landsat_7" "1999-04-15T00:00:00-06:00" "--04-15" "1999"
"http://data.globalchange.gov/platform/oceansat-1" "http://dbpedia.org/resource/Oceansat-1" "1999-05-26T00:00:00-06:00" "1999-05-26"
"http://data.globalchange.gov/platform/odin" "http://dbpedia.org/resource/Odin_(satellite)" "2001-02-20T00:00:00-06:00" "--02-20" "2001"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM A: 2013-067B"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM B: 2013-067A"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM C: 2013-067C"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "43349.0" "SWARM A: 2013-067B"
bduggan commented 9 years ago

On Tuesday, August 25, Stephan Zednik wrote:

see http://data.globalchange.gov/lexicon/dbpedia.thtml for example in REST API.

This statement is generated by the representation.ttl.tut template.

@bduggan is it possible RDF from this template is not being included in the triplestore load?

Yes, fixed.

I'm re-runing the import, should be updated in 30 minutes or so.

Brian

bduggan commented 9 years ago

A nice refinement would be to only show entries for which the date differs. e.g. why is the GOES-2 launch listed as 1984 in one system and 1977 in another?

zednis commented 9 years ago

@bduggan that might be a bit hard to do in the query since we would have to potentially (but not always) combine and reformat the ?launchDate_dbpedia and ?cospar_dbpedia variables into a date. I think it would probably be easier to do that analysis in a spreadsheet where you can use some simple parsing logic to attempt to process dbepdia's inconsistent dates.

Also, I am attempting to update the gcis-sparql files for this report but the federated query is frequently timing out.

zednis commented 9 years ago

@justgo129 @bduggan query added to gcis-sparql. Is this ticket ready to be closed?

justgo129 commented 9 years ago

I just took a look. Is there a way to standardize the date formatting in the output?

zednis commented 9 years ago

It would be far easier to apply some post-processing to the query results to fix the dates then to add that logic to the query. The dbpedia RDF uses inconsistent literal types with the launch date values and updating the query to standardize the formatting would make the query much more complicated and probably make the timeout issue worse. Additionally, because they frequently split the year out of the launch date and encode the month and day of the launch using xsd:gMonthDay (which I have never seen used in RDF before) the process to standardize the query would be to extract the appropriate date components from ?launchDate_dbpedia and ?cospar_dbpedia (with checks because of the data inconsistency) and build a new date serialization using a string concatenation.

This could perhaps be done in the query but it would make it very ugly, and probably slower.

It would be much easier to do this as post-processing on the CSV using perl or python.

justgo129 commented 9 years ago

Works for me. Could we at least git rid of the "Cospar_dbpedia" entries beginning with "SWARM A" or would that also be a post-processing candidate? I'd think we could at least strip out the text within the SPARQL query. After that, feel free to repush and close.

zednis commented 9 years ago

I am not sure we should strip out values from ?cospar_dbepdia. That property is not explicitly for the year of the launch but for the COSPAR ID. It seems that the year is often (but not always) part of the COSPAR ID. I would keep the cospar id intact and leave the logic of parsing it and extracting relevant year information (if any) to post-processing.

justgo129 commented 9 years ago

@rewolfe are you all all right with the suggestion of @zednis?

rewolfe commented 9 years ago

@justgo129 - yes, I think that post-processing is the best approach. It looks like it is pretty difficult to parse dates in SPARQL.

On Wed, Sep 2, 2015 at 9:38 PM, justgo129 notifications@github.com wrote:

@rewolfe https://github.com/rewolfe are you all all right with the suggestion of @zednis https://github.com/zednis?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/117#issuecomment-137297180 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 9 years ago

Thanks, @rewolfe. As the query has been added to gcis-sparql, I declare #117 to be closed.