USGCRP / gcis-ontology

Ontology for the Global Change Information System
4 stars 7 forks source link

generate report on NCA author ORCID prevalence #118

Closed zednis closed 8 years ago

zednis commented 9 years ago

"Find the most frequent authors of articles cited by the NCA3. Is a mashup possible using ORCIDs? What other endpoints have ORCIDs for people?"

justgo129 commented 9 years ago

As an additional aside, various people with the same name as NCA3 authors have ORCIDs, but are not the same people as the authors. See e.g.: http://orcid.org/0000-0003-2869-9426 , which refers to a Tom Karl who differs from the NCA3 report editor. We may need to inspect manually, but if you could think of a nice little wrinkle to check for it through some method, it would be really cool. I'd solely focus on the issue as described above though.

congruili commented 9 years ago

The following query is not related to "ORCID" yet:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT
?author as ?ContributorID 
    str(?gn) as ?GivenName 
    str(?ln) as ?LastName
COUNT (DISTINCT ?article) as ?Frequency
FROM <http://data.globalchange.gov>
WHERE {
  <http://data.globalchange.gov/report/nca3> cito:cites ?article .
  ?article prov:qualifiedAttribution [ prov:agent ?author ] .
  ?author foaf:givenName ?gn .
  ?author foaf:lastName ?ln .
} group by ?author ?gn ?ln order by ?author
congruili commented 9 years ago

By the way, as was said to @zednis earlier, since no instances could be found using the property gcis:cites, it would be worth discussing whether or not we still keep it.

congruili commented 9 years ago

Would someone explicate how ORCIDs are currently being used in GCIS?

zednis commented 9 years ago
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>

SELECT * WHERE {
  ?person vivo:orcidId ?orcid
} 
justgo129 commented 9 years ago

@lic10 when you get a chance could you please inform as to the status of resolving this ticket?

congruili commented 9 years ago
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX vivo: <http://vivoweb.org/ontology/core#>

SELECT
?author as ?ContributorID 
    str(?gn) as ?GivenName 
    str(?ln) as ?LastName
    ?orcid
COUNT (DISTINCT ?article) as ?Frequency
FROM <http://data.globalchange.gov>
WHERE {
  <http://data.globalchange.gov/report/nca3> cito:cites ?article .
  ?article prov:qualifiedAttribution [ prov:agent ?author ] .
  ?author foaf:givenName ?gn .
  ?author foaf:lastName ?ln .
  ?author vivo:orcidId ?orcid .
}

I could find orcid for only 4 authors using the above query.

zednis commented 8 years ago

This query returns 177 results. http://yasgui.org/short/EJfyyHIgx

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX vivo: <http://vivoweb.org/ontology/core#>

SELECT
?author as ?ContributorID 
    str(?gn) as ?GivenName 
    str(?ln) as ?LastName
    ?orcid
COUNT (DISTINCT ?article) as ?Frequency
FROM <http://data.globalchange.gov>
WHERE {
  { <http://data.globalchange.gov/report/nca3> cito:cites ?article . } UNION {<http://data.globalchange.gov/report/nca3> gcis:hasChapter ?chapter . ?chapter cito:cites ?article . }
  ?article prov:qualifiedAttribution [ prov:agent ?author ] .
  ?author vivo:orcidId ?orcid .
  OPTIONAL { ?author foaf:givenName ?gn . }
  OPTIONAL { ?author foaf:lastName ?ln . }
}
rewolfe commented 8 years ago

Looks good!

On Tue, Oct 13, 2015 at 2:39 PM, Stephan Zednik notifications@github.com wrote:

This query returns 177 results. http://yasgui.org/short/EJfyyHIgx

PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX gcis: http://data.globalchange.gov/gcis.owl# PREFIX cito: http://purl.org/spar/cito/ PREFIX xsd: http://www.w3.org/2001/XMLSchema# PREFIX dbpprop: http://dbpedia.org/property/ PREFIX prov: http://www.w3.org/ns/prov# PREFIX foaf: http://xmlns.com/foaf/0.1/ PREFIX dcterms: http://purl.org/dc/terms/ PREFIX vivo: http://vivoweb.org/ontology/core#

SELECT ?author as ?ContributorID str(?gn) as ?GivenName str(?ln) as ?LastName ?orcid COUNT (DISTINCT ?article) as ?Frequency FROM http://data.globalchange.gov WHERE { { http://data.globalchange.gov/report/nca3 cito:cites ?article . } UNION {http://data.globalchange.gov/report/nca3 gcis:hasChapter ?chapter . ?chapter cito:cites ?article . } ?article prov:qualifiedAttribution [ prov:agent ?author ] . ?author vivo:orcidId ?orcid . OPTIONAL { ?author foaf:givenName ?gn . } OPTIONAL { ?author foaf:lastName ?ln . } }

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-147807071 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

:+1:

justgo129 commented 8 years ago

Job well done. Let's add this to the test suite / gcis-sparql repo, after which I will close this issue.

justgo129 commented 8 years ago

Closed #163 due to merged #163 , https://github.com/USGCRP/gcis-sparql/blob/master/ticket-118.csv, and https://github.com/USGCRP/gcis-sparql/blob/master/ticket-118.sparql.

justgo129 commented 8 years ago

I'm reopening #118 given the incorrect query results. The query and results could be found further up in this ticket and at: https://github.com/USGCRP/gcis-sparql/blob/master/118-ORCiD.sparql and https://github.com/USGCRP/gcis-sparql/blob/master/118-ORCiD.csv respectively.

The frequency numbers are incorrect in the .csv file. For instance, Dennis Lettenmaier is an author on far more than two articles cited in the NCA3. @zednis you know why the outputs from the query are as such? Thanks.

justgo129 commented 8 years ago

At second thought, @zednis @rewolfe could this get back to the fact that line 1 in the turtle template for reports maxes out at 20, and as such not every item cited by the NCA3 gets entered into the triplestore? This is resulting in a vast undercount of articles associated with authors having ORCiDs?

zednis commented 8 years ago

@justgo129 this is an issue with gcis-sparql query example, correct? Not a test in gcis-ontology?

justgo129 commented 8 years ago

Correct.

zednis commented 8 years ago

It's possible. If the turtle templates are being used to generate RDF imported into virtuoso then there should be no maxes in the templates.

justgo129 commented 8 years ago

I agree, but we don't want 3395 references to show up in at: https://data.globalchange.gov/report/nca3.thtml

@rewolfe would removing the max returns and having 3395 outputs show up in the turtle for the NCA3 report adversely impact performance?

zednis commented 8 years ago

I think the templates need to have a parameter that tells them whether to use maxes or not when they are run. During RDF generation for the triplestore we do not use any maxes. For THTML generation for the website we use maxes.

justgo129 commented 8 years ago

@zednis I'm all for that. How easy would it be to create the parameter and test the code?

zednis commented 8 years ago

@rewolfe might be able to answer that question best. If not, I will take a look at it, but I am still pretty unfamiliar with some of the template infrastructure.

rewolfe commented 8 years ago

@justgo129 - They all should definitely be in the Turtle output when loading virtuoso (I agree with Stephan). However, listing them all might cause a usability issue when someone is using a browser or API. Is there a convention for producing a reduced list vs. a complete list in Turtle? We already have the "?all=" flag for many lists. Maybe we need something similar for Turtle output (and/or the other represenations).

On Fri, Dec 11, 2015 at 2:38 PM, justgo129 notifications@github.com wrote:

I agree, but we don't want 3395 references to show up in at: https://data.globalchange.gov/report/nca3.thtml

@rewolfe https://github.com/rewolfe would removing the max returns and having 3395 outputs show up in the turtle for the NCA3 report adversely impact performance?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-164029865 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

rewolfe commented 8 years ago

Justin, This needs to be on our list. I guess that for now our Virtuoso store is incomplete.

On Fri, Dec 11, 2015 at 2:44 PM, Robert Wolfe rewolfe@usgcrp.gov wrote:

@justgo129 - They all should definitely be in the Turtle output when loading virtuoso (I agree with Stephan). However, listing them all might cause a usability issue when someone is using a browser or API. Is there a convention for producing a reduced list vs. a complete list in Turtle? We already have the "?all=" flag for many lists. Maybe we need something similar for Turtle output (and/or the other represenations).

On Fri, Dec 11, 2015 at 2:38 PM, justgo129 notifications@github.com wrote:

I agree, but we don't want 3395 references to show up in at: https://data.globalchange.gov/report/nca3.thtml

@rewolfe https://github.com/rewolfe would removing the max returns and having 3395 outputs show up in the turtle for the NCA3 report adversely impact performance?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-164029865 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

zednis commented 8 years ago

@rewolfe I am not sure what you mean by a reduced list vs complete list in turtle; my guess though is no. If a list item is not in the turtle it is not in the RDF.

justgo129 commented 8 years ago

btw @zednis the turtle templates can be found inside many of the folders at: https://github.com/USGCRP/gcis/tree/master/lib/Tuba/files/templates . e.g. the one for book can be found in the "book" folder, that for chapters is in the chapter folder, and so on. The file names are always called object.ttl.tut.

zednis commented 8 years ago

@justgo129 I know where the templates are, I am just not familiar with what runs the templates and how parameters can be passed to them.

justgo129 commented 8 years ago

ah, got it

rewolfe commented 8 years ago

@zendis - I was hoping that there was a standard way of saying "...", "etc" or "this is not the complete list". That is, there are more items of the preceding type, but they are not included (for whatever reason).

On Fri, Dec 11, 2015 at 3:00 PM, justgo129 notifications@github.com wrote:

ah, got it

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-164034865 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

zednis commented 8 years ago

@rewolfe nope, RDF is a data format. It does not include things like "..." that are for presentation purposes. RDF is also open-world so there is no need to say a list is complete / incomplete, as it is not assumed the list is complete.

rewolfe commented 8 years ago

@zednis - Got it, so I think using the "?all=1" flag is a good approach. I'll look at the code to see how easy this is to implement across the various rendered formats.

On Fri, Dec 11, 2015 at 3:10 PM, Stephan Zednik notifications@github.com wrote:

@rewolfe https://github.com/rewolfe nope, RDF is a data format. It does not include things like "..." that are for presentation purposes. RDF is also open-world so there is no need to say a list is complete, as it is not assumed the list is complete.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-164036836 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

zednis commented 8 years ago

That might make sense in the THTML resource, but is definitely to be avoided in the actual TTL resource.

On that note, I think the tags at the bottom should be updated. The representation tag "Turtle" does go to a TTL resource, but a HTML page. It is a presentation of Turtle embedded in HTML and designed for people, but it is not a TTL file.

I think the "Turtle" but should reference the .ttl file and we should add a "Turtle in HTML" (or other) for the THTML. Or perhaps we get rid of THTML altogether?

justgo129 commented 8 years ago

@rewolfe I'll go ahead and edit the code at: https://github.com/USGCRP/gcis/edit/master/lib/Tuba/files/templates/prov.ttl.tut to remove the max_children. On the code at the aforementioned URL, what change should I make to line 12, as it invokes "max_children"?

Regarding the bottom tags, I'll defer to @rewolfe. I'd be fine to include tabs: "Turtle (hyperlinks)" or "Turtle (Raw"). We actually have both functionalities, as clicking on "raw" on a thtml page should bring one to the raw turtle file. @rewolfe and I, as did Brian, really like the thtml.

rewolfe commented 8 years ago

@justgo129 - we also need to make the "?all=1" flag change

On Mon, Dec 14, 2015 at 9:10 AM, justgo129 notifications@github.com wrote:

@rewolfe https://github.com/rewolfe I'll go ahead and edit the code at: https://github.com/USGCRP/gcis/edit/master/lib/Tuba/files/templates/prov.ttl.tut to remove the max_children. On the code at the aforementioned URL, what change should I make to line 12, as it invokes "max_children"?

Regarding the bottom tags, I'll defer to @rewolfe https://github.com/rewolfe. I'd be fine to include tabs: "Turtle (hyperlinks)" or "Turtle (Raw"). We actually have both functionalities, as clicking on "raw" on a thtml page should bring one to the raw turtle file. @rewolfe https://github.com/rewolfe and I, as did Brian, really like the thtml.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-164446971 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

@rewolfe, great, I committed the change to a new branch: https://github.com/USGCRP/gcis/commit/4896f5a9d0940e92aaaa9751b2ed3da73047e143 No pull request yet. Does my change make sense?

zednis commented 8 years ago

Does the ?all=1 query param get passed to the template? I believe for this idea to work that a different set of turtle will need to be generated based on whether it is THTML or TTL.

Since THTML is HTML with a <pre> (or similar) around the RDF the absence ?all=1 would have to cause a change in the embedded RDF, either by being passed to the template or invoking a JS method that would change the RDF content.

zednis commented 8 years ago

Would it be too complicated to change the flag so that it specifies the cap Instead of "?all=1", generating everything is default and "?max=" is used to specify the number of max children.

justgo129 commented 8 years ago

@rewolfe what do you think?

rewolfe commented 8 years ago

I need to look at the details of this. Let's discuss this more after AGU.

On Thu, Dec 17, 2015 at 9:11 AM, justgo129 notifications@github.com wrote:

@rewolfe https://github.com/rewolfe what do you think?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-165461713 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

Sure thing. @zednis is there a stop-gap solution in order to ensure a complete Virtuoso in time for ESIP?

justgo129 commented 8 years ago

Stupid me, I forgot we could just set max_contributors to exceed the number of references. Such hard-coding isn't a best practice but could be a nice placeholder in the meantime for the purpose of fully populating Virtuoso for accurate SPARQL query results.

justgo129 commented 8 years ago

I guess I should retract that statement. See lines 229-234 of the build in the pull request.

t/004_report.t ......... 16/? 
#   Failed test 'exact match for JSON Pointer ""'
#   at t/004_report.t line 42.
#     Structures begin differing at:
#          $got->{title} = 'Chapter one � two'
#     $expected->{title} = 'Chapter one ± two'
zednis commented 8 years ago

I will be honest, I am not sure how to interpret that error message. The title of a chapter is wrong?

justgo129 commented 8 years ago

Looks so, @zednis. I also changed max_children from 20 to 21 just to confirm that the issue isn't with an overloading of Virtuoso. The same error message appeared (see #234).

The test hasn't been updated since May, so an external dependency is probably responsible.

@zednis do you know which keystroke produces the � icon?

As such, it appears that we'll need to resolve this before pushing any code @rewolfe.

justgo129 commented 8 years ago

I spotted an ampersand embedded within a report title and replaced it with "and." It didn't help though.

zednis commented 8 years ago

Perhaps the template should ensure all text values in the RDF are UTF-8?

zednis commented 8 years ago

http://search.cpan.org/~jhi/perl-5.8.1/ext/Encode/Encode.pm

Lets update the script to encode the title string as UTF-8 and see if this error still occurs.

rewolfe commented 8 years ago

It looks like it is failing because of some change in a dependency. I viewed the Travis logs and found the main configuration difference below. I'm not sure how this caused the problem.

I'd like to keep the non-UTF-8 string in the test (t/004_report.t) until we understand why this bug is occurring.


From test 1051.1 - failed, https://travis-ci.org/USGCRP/gcis/jobs/97720546, line 181:

$ cpan-outdated | cpanm

From test 1043.1 - passed, https://travis-ci.org/USGCRP/gcis/jobs/95423236, lines 183 to 194:

$ cpan-outdated | cpanm --> Working on E/ET/ETHER/libwww-perl-6.15.tar.gz Fetching http://www.cpan.org/authors/id/E/ET/ETHER/libwww-perl-6.15.tar.gz ... OK Configuring libwww-perl-6.15 ... OK Building libwww-perl-6.15 ... OK Successfully installed libwww-perl-6.15 --> Working on S/SR/SRI/Mojolicious-6.35.tar.gz Fetching http://www.cpan.org/authors/id/S/SR/SRI/Mojolicious-6.35.tar.gz ... OK Configuring Mojolicious-6.35 ... OK Building Mojolicious-6.35 ... OK Successfully installed Mojolicious-6.35 2 distributions installed


On Fri, Dec 18, 2015 at 2:35 PM, Stephan Zednik notifications@github.com wrote:

http://search.cpan.org/~jhi/perl-5.8.1/ext/Encode/Encode.pm

Lets update the script to encode the title string as UTF-8 and see if this error still occurs.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-165878757 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

@rewolfe sounds good. Which next steps should we take in rectifying the bug?

zednis commented 8 years ago

If I am reading the spec correctly, the charset of turtle should always be UTF-8

http://www.w3.org/TR/turtle/#sec-mediaReg

rewolfe commented 8 years ago

This is weird. I just looked at "t/003_lists.t https://github.com/USGCRP/gcis/blob/master/t/003_lists.t" and it also uses the "+/-" character and doesn't have a problem.

On Mon, Dec 21, 2015 at 3:44 PM, Stephan Zednik notifications@github.com wrote:

If I am reading the spec correctly, the charset of turtle should always be UTF-8

http://www.w3.org/TR/turtle/#sec-mediaReg

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/issues/118#issuecomment-166412561 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966