USGCRP / gcis-ontology

Ontology for the Global Change Information System
4 stars 7 forks source link

Article->AcademicArticle #181

Closed justgo129 closed 8 years ago

justgo129 commented 8 years ago

Per #180.

zednis commented 8 years ago

:+1: I think we should verify that all instances of gcis:Article have been updated to gcis:AcademicArticle in the tuba templates in gcis as well.

justgo129 commented 8 years ago

Excellent idea, @zednis. I have just verified this. Merged #181.

justgo129 commented 8 years ago

This will hopefully solve the issue. @zednis why would the initial SPARQL query (with "AcademicArticle") still return 34 results? Ditto with fabio:Article?

rewolfe commented 8 years ago

Also, why doesn't the triple-store load script complain? On Jan 4, 2016 12:56 PM, "justgo129" notifications@github.com wrote:

This will hopefully solve the issue. @zednis https://github.com/zednis why would the initial SPARQL query (with "AcademicArticle") still return 34 results? Ditto with fabio:Article?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-168751964.

zednis commented 8 years ago

@rewolfe open world assumption. There is no constraint that a class URI has to be known before it is used.

rewolfe commented 8 years ago

@zednis - Understand. Is there a way to check to see if there all of the class URIs are valid after everything is loaded.

zednis commented 8 years ago

The question is how to determine if something is "valid"; Any URI is structurally valid as long as it is a valid URI - even if it is a typo and there is no pre-existing class with that URI.

We could check on the type of the URI - ensure that for class used there is additional known information about that URI and especially that things we want to use as classes have type owl:Class or rdfs:Class; that will not work with external vocabularies unless we have imported them into the triplestore.

edit - this is similar to the issue we had originally where instances and classes were being used as properties. One problem with our initial check of whether a class was being used as a property was that not all external vocabularies were loaded into the triplestore so the queries did not return the full account of where the issue was occurring.

rewolfe commented 8 years ago

I think that we first of concerned with our own vocabulary (gcis).

I think that Brian uses "riot" to validate the vocabulary. See: https://github.com/USGCRP/gcis-ontology/blob/master/run-tests.sh#L44

On Mon, Jan 4, 2016 at 4:00 PM, Stephan Zednik notifications@github.com wrote:

The question is how to determine if something is "valid"; Any URI is structurally valid as long as it is a valid URI - even if it is a typo and there is no pre-existing class with that URI.

We could check on the type of the URI - ensure that for class used there is additional known information about that URI and especially that things we want to use as classes have type owl:Class or rdfs:Class; that will not work with external vocabularies unless we have imported them into the triplestore.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-168807267.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

zednis commented 8 years ago

RIOT is a jena command line utility for Input/Output of RDF and can be used to check the syntactic validity of an RDF input.

https://jena.apache.org/documentation/io/

riot --validate is the same as riot --strict --sink --check=true which means the command uses strict IRI and literal syntax, it checks for valid URIs and literals in the RDF (and stops processing if there is an error) and that it does not print out a result RDF stream - just info on whether the input stream passed the check.

RIOT is very useful to check that a RDF has correct syntax and valid/legal URIs and Literal values, but it will not check semantics or OWL profiles. It also does not perform reasoning so it will not determine if an input RDF file is consistent according to a chosen OWL profile.

justgo129 commented 8 years ago

To update the initial issue, the latest version of the Ontology (with an entry gcis:AcademicArticle) appears on dev, stage, and prod. However, even after reloading virtuoso on dev, test, and prod, the problem persists. On prod, gcis:Article returns the correct number of triples, even though it no longer exists in the ontology. gcis:AcademicArticle returns very few.

zednis commented 8 years ago

That sounds like a problem with the ingest. Have you checked that gcis:Article is no longer being used by any of the templates? Have you confirmed that the ingest process is running a current master of the ingest templates?

justgo129 commented 8 years ago

Yep, I sure have confirmed that gcis:Article is no longer being used. @rewolfe I've looked over the ingest templates file but can't locate the position of the ingest templates master.

https://github.com/USGCRP/gcis-rdf/blob/master/load_rdf_sources.pl

rewolfe commented 8 years ago

@justgo129 - I think the issue is that "a gcis:Article" is still being used in the Person template. See:

http://data-stage.globalchange.gov/person/2679.thtml

That is why only articles that do not have an author are correct.

On Tue, Jan 5, 2016 at 11:41 AM, justgo129 notifications@github.com wrote:

Yep, I sure have confirmed that gcis:Article is no longer being used. @rewolfe https://github.com/rewolfe I've looked over the ingest templates file but can't locate the position of the ingest templates master.

https://github.com/USGCRP/gcis-rdf/blob/master/load_rdf_sources.pl

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169056096.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

Excellent find, I can't believe I missed that. In the aforementioned webpage, the contributors were ingested using line 56 of this template.

It seems that we'll need to change the publication_type_identifier. However, changing that may change how articles are fed into gcis, i.e., we'll have a category called "AcademicArticle" instead of "Article" in the dropdown menus, searches, etc. I'm rather hesitant to go that route. What do you think?

Other places where publication_type_identifier can be found are listed here.

rewolfe commented 8 years ago

Do we need to specify the resource type here? On Jan 5, 2016 12:27 PM, "justgo129" notifications@github.com wrote:

Excellent find, I can't believe I missed that. In the aforementioned webpage, the contributors were ingested using this https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/prov.ttl.tut template. See line 56.

It seems that we'll need to change the publication_type_identifier. However, changing that may change how articles are fed into gcis, i.e., we'll have a category called "AcademicArticle" instead of "Article" in the dropdown menus, searches, etc. I'm rather hesitant to go that route. What do you think?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169071580.

justgo129 commented 8 years ago

I'm fine with leaving it out. @zednis what do you think? It would lose the subject "a" object predicate but I haven't seen elsewhere that it's actually required.

zednis commented 8 years ago

I think it will be ok to remove line 56 from prov.ttl.tut. The type information for the publication resource should be declared elsewhere, correct?

justgo129 commented 8 years ago

@zednis, correct. In this specific example, the resource type is declared here

zednis commented 8 years ago

OK, I say comment that line out and re-run ingest.

rewolfe commented 8 years ago

@justgo129 I think you meant this template:

https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/person/object.ttl.tut

this template also needs to be changed:

https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/organization/contributors.ttl.tut

see:

https://github.com/USGCRP/gcis/search?utf8=%E2%9C%93&q=gcis+ucfirst+obj&type=Code

On Tue, Jan 5, 2016 at 12:38 PM, justgo129 notifications@github.com wrote:

I'm fine with leaving it out. @zednis https://github.com/zednis what do you think? It would lose the subject "a" object predicate but I haven't seen elsewhere that it's actually required.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169076140.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

rewolfe commented 8 years ago

Okay, there are three instances that need to be changed. The one Justin identified and the two that I found.

On Tue, Jan 5, 2016 at 1:10 PM, Stephan Zednik notifications@github.com wrote:

OK, I say comment that line out and re-run ingest.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169085326.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

rewolfe commented 8 years ago

This may be a 4th instance:

https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/activity/object.ttl.tut

On Tue, Jan 5, 2016 at 1:18 PM, Robert Wolfe rewolfe@usgcrp.gov wrote:

Okay, there are three instances that need to be changed. The one Justin identified and the two that I found.

On Tue, Jan 5, 2016 at 1:10 PM, Stephan Zednik notifications@github.com wrote:

OK, I say comment that line out and re-run ingest.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169085326 .

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

Sounds good. I've opened a pull request which is running Travis: https://github.com/USGCRP/gcis/pull/259

@zednis could you please take a look at the commits there and let me know whether commenting out the lines would create a gap in the turtle output? e.g. instead of:

<> a gcis:Article; something

we get: <>

something ?

justgo129 commented 8 years ago

Then, I'll rerun that pull request using %# in-lieu of ## for denoting comments.

justgo129 commented 8 years ago

Running the effects of #259 on dev, triplestore rebuilding. More info later.

rewolfe commented 8 years ago

@justgo129 - One last change to contributors (line 28) is needed.

https://github.com/USGCRP/gcis/blob/cc6205b3434aba8fbb69b6629cfd268345d44b41/lib/Tuba/files/templates/organization/contributors.ttl.tut#L28

On Tue, Jan 5, 2016 at 2:52 PM, justgo129 notifications@github.com wrote:

Running the effects of #259 on dev, triplestore rebuilding. More info later.

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169110697.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

justgo129 commented 8 years ago

Done in #260. Thanks for catching.

justgo129 commented 8 years ago

I just pushed the code (with the correct comment symbols) to dev, test, and prod, and subsequently did a content push and virtuoso rebuild on prod. The results of our SPARQL query on prod have not changed, though. Still only 27 returns for "AcademicArticle."

zednis commented 8 years ago

what about when you search for gcis:Article?

justgo129 commented 8 years ago

@zednis - 0 results.

zednis commented 8 years ago

ok, that's good. Now we just need to make sure all

example: http://data.globalchange.gov/article/10.1002/2014EF000255.thtml

<http://data.globalchange.gov/article/10.1002/2014EF000255>   
   dcterms:identifier "10.1002/2014EF000255";
   dcterms:title "Urbanization and the carbon cycle: Current capabilities and research outlook from the natural sciences perspective"^^xsd:string;
   dcterms:isPartOf <http://data.globalchange.gov/journal/earths-future>;
   bibo:volume "2";
   bibo:pages "473-495";
   dbpprop:pubYear "2014"^^xsd:gYear;
   bibo:doi "10.1002/2014EF000255";

   a gcis:AcademicArticle, fabio:Article .

# more ...

The following query should generate a result that includes all of the above information (and probably more from assertions made about this resource from other templates)

describe <http://data.globalchange.gov/article/10.1002/2014EF000255>

results:

@prefix ns0:    <http://purl.org/dc/terms/> .
@prefix ns1:    <http://data.globalchange.gov/journal/> .
ns1:earths-future   ns0:hasPart <http://data.globalchange.gov/article/10.1002/2014EF000255> .
@prefix prov:   <http://www.w3.org/ns/prov#> .
@prefix ns3:    <http://data.globalchange.gov/report/> .
ns3:usgcrp-ocpfy2015    prov:wasDerivedFrom <http://data.globalchange.gov/article/10.1002/2014EF000255> .
<http://data.globalchange.gov/article/10.1002/2014EF000255> ns0:identifier  "10.1002/2014EF000255" .
@prefix ns4:    <http://data.globalchange.gov/report/usgcrp-ocpfy2015/chapter/> .
<http://data.globalchange.gov/article/10.1002/2014EF000255> prov:wasDerivedFrom ns4:federal-investments-in-global-change-research ;
    prov:qualifiedAttribution   _:vb2629926 ,
        _:vb2628704 ,
        _:vb2629927 ,
        _:vb2624040 ,
        _:vb2625918 ,
        _:vb2607182 ,
        _:vb2617456 ,
        _:vb2629919 ,
        _:vb2617457 ,
        _:vb2628945 ,
        _:vb2629439 ,
        _:vb2611856 ,
        _:vb2611857 ,
        _:vb2629437 ,
        _:vb2617454 ,
        _:vb2617452 ,
        _:vb2617453 ,
        _:vb2617948 ,
        _:vb2604751 ,
        _:vb2604749 ,
        _:vb2629922 ,
        _:vb2628986 .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
<http://data.globalchange.gov/article/10.1002/2014EF000255> ns0:title   "Urbanization and the carbon cycle: Current capabilities and research outlook from the natural sciences perspective"^^xsd:string ;
    ns0:isPartOf    ns1:earths-future .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
<http://data.globalchange.gov/article/10.1002/2014EF000255> bibo:volume "2" .

It is missing some of the information; pages, pubYear, bio:doi, type, ...

edit: so in summary, what is showing up in the THTML is not all ending up in the triplestore.

I really do not think what we see in the THTML is all being loaded into the triplestore during the ingest process.

justgo129 commented 8 years ago

@zednis that makes a lot of sense. @rewolfe do you see anything in the triplestore ingest file that may be worth revisiting? I don't see anything at first glance.

justgo129 commented 8 years ago

@zednis @rewolfe Could the answer be with revising line 16 at: https://github.com/USGCRP/gcis-rdf/blob/master/dropgraphs

to replace gcis-dev-front...nca3draft with data.globalchange.gov ?

justgo129 commented 8 years ago

Also, I just ran the "describe" command on one of the articles which pops up when querying gcis:AcademicArticle. From where does rdf:type ns4:Article come? Where it is being stored?

describe <http://data.globalchange.gov/article/10.3368/le.78.4.465>

@prefix cito:   <http://purl.org/spar/cito/> .
@prefix ns1:    <http://data.globalchange.gov/report/> .
ns1:nca3    cito:cites  <http://data.globalchange.gov/article/10.3368/le.78.4.465> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix gcis:   <http://data.globalchange.gov/gcis.owl#> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  rdf:type    gcis:AcademicArticle .
@prefix ns4:    <http://purl.org/spar/fabio/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  rdf:type    ns4:Article .
@prefix ns5:    <http://data.globalchange.gov/report/nca3/chapter/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  cito:isCitedBy  ns5:decision-    support ,
        ns1:nca3 .
@prefix ns6:    <http://purl.org/dc/terms/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  ns6:identifier        "10.3368/le.78.4.465" .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix dbpprop:    <http://dbpedia.org/property/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  dbpprop:pubYear "2002-01-    01T00:00:00-07:00"^^xsd:gYear ;
ns6:title   "The effects of open space on residential property values"^^xsd:string .
@prefix bibo:   <http://purl.org/ontology/bibo/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  bibo:doi    "10.3368/le.78.4.465" .
@prefix ns10:   <http://data.globalchange.gov/journal/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  ns6:isPartOf    ns10:land-    economics .
@prefix biro:   <http://purl.org/spar/biro/> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  biro:isReferencedBy     <http://data.globalchange.gov/reference/5c08b57a-767e-4fb2-825c-d335902c5a5e> .
ns10:land-economics ns6:hasPart <http://data.globalchange.gov/article/10.3368/le.78.4.465>     .
ns5:decision-support    cito:cites  <http://data.globalchange.gov/article/10.3368/le.78.4.465> .
<http://data.globalchange.gov/reference/5c08b57a-767e-4fb2-825c-d335902c5a5e>       biro:references <http://data.globalchange.gov/article/10.3368/le.78.4.465> .
zednis commented 8 years ago

@justgo129 the prefix ns4 is for fabio, so that is the same as fabio:Article. The issue is that the binding of the prefix 'fabio' is not set for the query.

@prefix ns4:    <http://purl.org/spar/fabio/> .
zednis commented 8 years ago

@justgo129 that describe looks better than what I got with my earlier example. Have you made any changes to the ingest?

justgo129 commented 8 years ago

No changes whatsoever.

Regarding fabio, this results from:

PREFIX fabio: <http://purl.org/spar/fabio> 
describe <http://data.globalchange.gov/article/10.3368/le.78.4.465>

@prefix cito:   <http://purl.org/spar/cito/> .
@prefix ns1:    <http://data.globalchange.gov/report/> .
 ns1:nca3   cito:cites  <http://data.globalchange.gov/article/10.3368/le.78.4.465> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix gcis:   <http://data.globalchange.gov/gcis.owl#> .
<http://data.globalchange.gov/article/10.3368/le.78.4.465>  rdf:type    gcis:AcademicArticle .
@prefix ns4:    <http://purl.org/spar/fabio/> .
 <http://data.globalchange.gov/article/10.3368/le.78.4.465> rdf:type    ns4:Article .
zednis commented 8 years ago

@justgo129 you need to add the trailing slash to the fabio prefix binding

It should be PREFIX fabio: <http://purl.org/spar/fabio/>

justgo129 commented 8 years ago

d'oh!

justgo129 commented 8 years ago

so gcis:AcademicArticle appears here where contributors are listed but not here where there are no contributors listed.

See line 22 of the article turtle template. If contributors aren't listed, i.e., nothing for contributors.ttl.tut, then gcis:AcademicArticle in the aforementioned article turtle template disappears. Does anything look out of place on the contributors.ttl.tut template?

zednis commented 8 years ago

@justgo129 So are we picking up the type information from the template that asserts the contributors?

justgo129 commented 8 years ago

Maybe. Could you please rephrase the question? I don't see "type" in either template.

justgo129 commented 8 years ago

By the way, I'm not sure the issue is with the "contributors" template since SPARQL queries are returning the correct number of reports, which also contains a "%include = 'contributors'" line.

zednis commented 8 years ago

@justgo129 the rdf:type assertions. Why do some publications have them, and others do not?

select * FROM <http://data.globalchange.gov> where { <http://data.globalchange.gov/article/10.1002/crq.3890180403> a ?type }

shows 0 results

justgo129 commented 8 years ago

Hmm, where would I find the encoding of ?type ? https://github.com/justgo129/gcis/search?utf8=%E2%9C%93&q=rdf%3Atype&type=Code

zednis commented 8 years ago

https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/article/object.ttl.tut

it seems like this template was not called for <http://data.globalchange.gov/article/10.1002/crq.3890180403>

justgo129 commented 8 years ago

Sorry, @zednis , I'm a little confused. http://data.globalchange.gov/article/10.1002/crq.3890180403 , like every other article, should be invoked using the template https://github.com/USGCRP/gcis/blob/master/lib/Tuba/files/templates/article/object.ttl.tut i.e., all the articles should be invoked by the same template. How would I find out which template is invoking the other articles?

zednis commented 8 years ago

That template should be called, but something is going wrong since RDF statements that template should be creating are not ending up in the triplestore for some articles.

The template

% layout 'default', namespaces => [qw/dcterms xsd bibo dbpprop gcis fabio cito biro/];
%= filter_lines_with empty_predicate() => begin
%#
<<%= current_resource %>>   
   dcterms:identifier "<%= $article->identifier %>";
   dcterms:title "<%= $article->title %>"^^xsd:string;
   dcterms:isPartOf <<%= uri($article->journal) %>>;
   bibo:volume "<%= $article->journal_vol %>";
   bibo:pages "<%= $article->journal_pages %>";
   dbpprop:pubYear "<%= $article->year %>"^^xsd:gYear;
% if ($article->doi) {
   bibo:doi "<%= $article->doi %>";
% } else {
   gcis:hasURL "<%= $article->url %>"^^xsd:anyURI;
% }

   a gcis:AcademicArticle, fabio:Article . # <-- HERE
% end

should establish that each article has the following as values for rdf:type: fabio:Article and gcis:AcademicArticle and yet, for the URI I linked above, there are 0 known rdf:type values for the resource.

So something must be going wrong during the ingest.

justgo129 commented 8 years ago

Great. @rewolfe is there a way to check the ingest log, or the code for ingests? Is it any of the files called earlier in this thread or are there others?

rewolfe commented 8 years ago

In the Virtuoso load:

https://github.com/USGCRP/gcis-rdf/blob/master/Virtuoso.pm

the function DB.DBA.FN_TTLP_MT

http:// http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html docs.openlinksw.com http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html /virtuoso/ http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.htmlfn http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html_ http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.htmlttlp http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html_ http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.htmlmt.html http://docs.openlinksw.com/virtuoso/fn_ttlp_mt.html

is called. Notice that the flag variable is set to 255 which means that it ignores most (all?) load errors.

Great. @rewolfe https://github.com/rewolfe is there a way to check the ingest log, or the code for ingests? Is it any of the files called earlier in this thread or are there others?

— Reply to this email directly or view it on GitHub https://github.com/USGCRP/gcis-ontology/pull/181#issuecomment-169454236.