USGCRP / gcis-ontology

Ontology for the Global Change Information System
4 stars 7 forks source link

generate author report #113

Closed zednis closed 8 years ago

zednis commented 9 years ago

"Generate a list of all NCA3 authors, sortable by chapter and organization. Each author can only be mentioned once but nonetheless all chapters authored be stated and captured."

This is the request that Bryce [NCO Colleague] had received a while back.

zednis commented 9 years ago

@justgo129 Should I include the author's role? (editor, lead author, author)

justgo129 commented 9 years ago

That would be great. So technically we'd be asking for "contributors" rather than "authors."

zednis commented 9 years ago

@bduggan @justgo129 This should be doable in a SPARQL query, but the virtuoso instance that appears to be running the endpoint does not seem to fully support group_concat( ).

from the footer on http://data.globalchange.gov/sparql it appears we are running Virtuoso version 06.01.3127 (from 2011?).

Would it be possible to explore upgrading the version of virtuoso we are using as our endpoint? If not, I can run a query that does not do the desired grouping to generate a CSV and then write a script to post-process the CSV.

justgo129 commented 9 years ago

@zednis would upgrading the version of Virtuoso also provide ability to query classes from subclasses? For instance, a query "a prov:Entity" doesn't generate a list of platforms, instruments etc which are defined as subclasses of prov:Entity within the gcis ontology. In short. I'm wondering whether upgrading would allow us to "kill two birds with one stone."

zednis commented 9 years ago

To support the subclass query you describe we need to utilize RDFS or OWL inference.

We should look at virtuoso to see if there is an option to enable query-time (e.g. backward-chaining) RDFS inference. That may be available on the version of virtuoso you are running or a newer version.

First, let's confirm the version of virtuoso we are using and then we can see where we stand on these two features.

edit - if no versions of virtuoso support RDFS inference (I have not checked yet) we could always use jena or pellet to run the inference during the ingest process before it is imported into virtuoso. Then we would be able to answer the subclass query you mention.

zednis commented 9 years ago

@justgo129 It looks like virtuoso supports rdfs:subClassOf and rdfs:subPropertyOf inferences (and a few others). We should be able to enable it with some configuration changes.

http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html

bduggan commented 9 years ago

On Wednesday, August 5, Stephan Zednik wrote:

Would it be possible to explore upgrading the version of virtuoso we are using as our endpoint?

Yes, but probably not for a while. Another issue with this version (or possibly just the configuration) is that federated queries don't work. I find virtuoso to be very cumbersome to maintain and configure and would be fine moving to another triple store. I have heard good things about blazegraph (for instance, that they are being used by wikidata) so maybe that is an option.

If not, I can run a query that does not do the desired grouping to generate a CSV and then write a script to post-process the CSV.

If you could just help write sparql to get the data I think that's probably enough -- post processing could even just be done in excel.

Brian

zednis commented 9 years ago

ok, I think it would make sense to create a new ticket or email thread around SPARQL endpoint issues so we can keep track of functionality we are having trouble getting to work and discussions of possible solutions (upgrading, change endpoint, etc)

zednis commented 9 years ago

Here is a query that gets basic NCA3 chapter contributor information. It does not group the chapters for each contributor into a single value because I was unable to get SPARQL's GROUP_CONCAT to work correctly with the endpoint. (I was able to get a weird non-standard form of sql:GROUP_CONCAT to somewhat work, but it included duplicates)

Also, I am currently returning role information as well. We may want to consider how role information will affect the original request of mentioning each author/contributor only once.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT 
    ?author as ?ContributorID 
    str(?gn) as ?GivenName 
    str(?ln) as ?LastName
    ?role
    str(?cht) as ?ChapterName
FROM <http://data.globalchange.gov>
WHERE {
  <http://data.globalchange.gov/report/nca3> gcis:hasChapter ?chapter .
  ?chapter dcterms:title ?cht .
  ?chapter prov:qualifiedAttribution [ prov:hadRole ?role ; prov:agent ?author ] .
  ?author foaf:givenName ?gn .
  ?author foaf:lastName ?ln
} group by ?author ?gn ?ln order by ?author
justgo129 commented 9 years ago

Thanks, @zednis. I pasted the query into the GCIS (https://data-stage.globalchange.gov/sparql) but got no outputs other than column names though. Do you know why?

zednis commented 9 years ago

@justgo129 I do not.

Try it in http://data.globalchange.gov/sparql or with yasgui.org

bduggan commented 9 years ago

On Thursday, August 6, justgo129 wrote:

Thanks, @zednis. I pasted the query into the GCIS (https://data-stage.globalchange.gov/sparql) but got no outputs other than column names though. Do you know why?

That non-public endpoint has not been updated for some time.

Brian

zednis commented 9 years ago

What are the requirements for closing this ticket?

justgo129 commented 9 years ago

The generation of the query will enable the closing of this ticket. The outputs look good, but I see the nonetheless a few author duplicates (e.g. Paul Fleming) even though they are from different chapters. I'd be happy to spin this off to another issue because I nonetheless see the value in the output produced by the code above.

On Tue, Aug 18, 2015 at 1:46 PM, Stephan Zednik notifications@github.com wrote:

What are the requirements for closing this ticket?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/113#issuecomment-132293323 .


Justin Goldstein, Ph.D. Advance Science Climate Data and Observing Systems Coordinator US Global Change Research Program 1800 G Street NW, Suite 9100, (Note New Address) Washington, D.C. 20006, U.S.A.

O: (202) 419-3496 M: (202) 285-3005

e-mail: jgoldstein AT usgcrp Dot gov http://www.globalchange.gov

bduggan commented 9 years ago

I would like to see a pull request that puts a sparql query like this into the test suite.

zednis commented 9 years ago

@justgo129 I don't think we can provide a report with no author duplicates unless we combine chapters into a single delimited value in cases where authors have contributor relationships with more than 1 chapter.

Alternatively we could provide the output as a JSON or a similar data structure.

justgo129 commented 9 years ago

Great, @zednis. How about you or @xgmachina prepare the pull request to place this entry into the test suite, after which I'll close the ticket.

zednis commented 9 years ago

@justgo129 I have added this query to the test suite in gcis-sparql. Is this ticket ready to be closed?

justgo129 commented 9 years ago

Assuming it works and provides the correct output (I haven't had a chance to test), yes.

bduggan commented 9 years ago

There are three test suites with SPARQL queries:

  1. acceptance tests (gcis-sparql)
  2. ontology unit tests (in this repository)
  3. gcis unit tests with sparql (t/011_sparql.t)

Adding to the acceptance tests (1) is great, and these may become examples for end users. These tests may have external dependencies (e.g. dbpedia), so may fail sometimes. Also the results may vary depending on the data.

The other two are run automatically by travis-ci -- adding to them is helpful because these give us regression tests, and guaranteed functionality.

At least some or some version of some of these SPARQL queries should be added to 2 and 3.

[edit] added sentence about data

zednis commented 9 years ago

@bduggan do you think that federated queries should be excluded from (2) and (3) since they have external dependencies?

bduggan commented 9 years ago

On Tuesday, September 8, Stephan Zednik wrote:

@bduggan do you think that federated queries should be excluded from (2) and (3) since they have external dependencies?

Yes.

Brian

justgo129 commented 8 years ago

@zednis I tested the code found at: https://github.com/USGCRP/gcis-sparql/blob/master/ticket-113.sparql at data.globalchange.gov/SPARQL That returned the following error: "Virtuoso 37000 Error SP031: SPARQL compiler: Variable 'Name' is used in the query result set but not assigned"

As the query does seem to work in yasgui, should I ignore that error? http://yasgui.org/short/NJh3AfcZe

zednis commented 8 years ago

@justgo129 interesting. Take the ?Name out of the group by clause and the query should work.

justgo129 commented 8 years ago

It sure does. The updated query is at: http://yasgui.org/short/EkCG5Gobg. @zednis how would I order the ChapterNumber values to go in order of 1, 2, 3, etc. instead of 1, 10, 11, ...2,21, ...?

zednis commented 8 years ago

Don't convent chapter number to a string.

justgo129 commented 8 years ago

Worked, and added to: https://github.com/USGCRP/gcis-sparql/ https://github.com/USGCRP/gcis-ontology/tree/justgo129-patch-1/t/results (will merge the latter soon)

justgo129 commented 8 years ago

@zednis do you agree that this should go into the test suite?

zednis commented 8 years ago

I don't think it needs to be in the gcis-ontology tests; it would be OK as a test in gcis (to test RDF templates) or gcis-sparql (as an example).

zednis commented 8 years ago

The reason I don't think it should be in GCIS-ontology is that the only class or property referenced in the query is gcis:hasChapter, so it is really a query on how we construct instance data using primarily non-GCIS properties.

If we want a test covering gcis:hasChapter in GCIS-ontology tests we should go with something much simpler.

justgo129 commented 8 years ago

Sounds good; I'll just close #113 since this has been added to gcis-sparql. @rewolfe will that disrupt any of your ongoing work?

justgo129 commented 8 years ago

Closed #113.