emit triples only against Branded URLs

VladimirAlexiev commented 7 years ago

When you download http://data.crystalbridges.org/exhibition/10 or http://data.americanartcollaborative.org/data/cbm/exhibition/10, you get this:

<http://data.crystalbridges.org/exhibition/10> rdf:type crm:E5_Event ;
      crm:P1_is_identified_by <http://data.crystalbridges.org/exhibition/10/appellation> .

<http://data.americanartcollaborative.org/data/cbm/exhibition/10>
      rdfs:label "RDF description of " ;
      foaf:primaryTopic <http://data.crystalbridges.org/exhibition/10> .

Paraphrasing, this says: there's a business entity at data.crystalbridges.org, which is described by a document at data.americanartcollaborative.org.

Small problems:

add a foaf:Document to the doc
add a useful rdfs:label, or remove it altogether

The bigger problem is that the server redirects the business URL to the document URL. If you trace curl -ILH accept:application/rdf+xml http://data.crystalbridges.org/exhibition/10 or the more visual traceback in #3, you'll see the redirects, finishing with a sort of loop at the document URL. So the server treats the two URLS as the same thing.

It's also a Branding issue:

Statements are made against museum-specific URLs, but are displayed as AAC URLs
I think the museums will be happier if the browser address bar shows their branded URL
Use the apache proxy_http module (ProxyRequest ProxyPass ProxyPassReverse) to fix this. Eg see https://github.com/AKSW/Sparqlify#configuration

VladimirAlexiev commented 7 years ago

Looking at other entities (http://data.americanartcollaborative.org/page/cbm/object/197, http://data.americanartcollaborative.org/page/cbm/object/197/group_title, even http://data.americanartcollaborative.org/page/cbm/exhibition/11), I don't see a doc with foaf:primaryTopic. So I guess there's some leftover triples in http://data.crystalbridges.org/exhibition/10, leftovers from an abandoned design.

Nevertheless, the Branding issue remains

caknoblock commented 7 years ago

We are not doing anything with the exhibitions or bibliography data right now. We need to clean up the old models/triples.

On Feb 22, 2017, at 5:57 AM, Vladimir Alexiev notifications@github.com wrote:

Looking at other entities (http://data.americanartcollaborative.org/page/cbm/object/197 http://data.americanartcollaborative.org/page/cbm/object/197, http://data.americanartcollaborative.org/page/cbm/object/197/group_title http://data.americanartcollaborative.org/page/cbm/object/197/group_title, even http://data.americanartcollaborative.org/page/cbm/exhibition/11 http://data.americanartcollaborative.org/page/cbm/exhibition/11), I don't see a doc with foaf:primaryTopic. So I guess there's some leftover triples in http://data.crystalbridges.org/exhibition/10 http://data.crystalbridges.org/exhibition/10, leftovers from an abandoned design.

Nevertheless, the Branding issue remains

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/american-art/semantic-hosting/issues/4#issuecomment-281675950, or mute the thread https://github.com/notifications/unsubscribe-auth/ABB-qcLysToXoRU4KrvHnrUJy6eB3XXHks5rfD7MgaJpZM4MIqMA.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/american-art/semantic-hosting","title":"american-art/semantic-hosting","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/american-art/semantic-hosting"}},"updates":{"snippets":[{"icon":"PERSON","message":"@VladimirAlexiev in #4: Looking at other entities (http://data.americanartcollaborative.org/page/cbm/object/197, http://data.americanartcollaborative.org/page/cbm/object/197/group_title, even http://data.americanartcollaborative.org/page/cbm/exhibition/11), I don't see a doc with foaf:primaryTopic. \r\nSo I guess there's some leftover triples in http://data.crystalbridges.org/exhibition/10, leftovers from an abandoned design.\r\n\r\nNevertheless, the Branding issue remains"}],"action":{"name":"View Issue","url":"https://github.com/american-art/semantic-hosting/issues/4#issuecomment-281675950"}}}

VladimirAlexiev commented 7 years ago

What do exhibitions have to do with this issue?

VladimirAlexiev commented 7 years ago

The semantic data is recorded sometimes against straight (branded URLs), eg

eg http://data.crystalbridges.org/object/108

and other times against AAC-ified URLs, eg

This is extremely confusing and makes the task of validation through http://review.americanartcollaborative.org very hard. @workergnome, how do you deal with this?

Permanent URLs should be well-designed and follow the same policy.

VladimirAlexiev commented 7 years ago

The last comment is related to but not the same as https://github.com/american-art/PUAM/issues/30

workergnome commented 7 years ago

I've been treating the URLs as opaque, and starting from a select ?id where {?id a crm:E22_Man-Made_Object} to get my initial list.

I didn't define URLs, both since I figured that those patterns should be defined by the museums and I'm not sure what the limitations of Karma are. Again, I am ambivalent—I believe they URLs should be opaque, so I've been treating them like that.

(which is not to say that they're meaningless—I agree completely that there are problems with the URLs chosen, and entities that are the same should share URLs.)

VladimirAlexiev commented 7 years ago

URLs should be opaque in SPARQL, no doubt about (eg no slicing of URLs should ever be needed).

But by not defining them, you've allowed students to make bad mistakes

entities that are the same should share URLs

Right: having separate title type for each instance of first name is crazy.

And also: entities that are different must have different URLs.

having the same URL for a title like "Flower" will mix title types of different objects

workergnome commented 7 years ago

See my comments in https://github.com/american-art/aac_mappings/issues/48: I agree these are problems, but I'm not sure I was (or am) the right person to specify URL patterns.

VladimirAlexiev commented 7 years ago

@caknoblock @workergnome About the "DNS/redirect" issue that's so heavily discussed right now:

Use the apache proxy_http module (ProxyRequest ProxyPass ProxyPassReverse) to fix this. Eg see https://github.com/AKSW/Sparqlify#configuration

VladimirAlexiev commented 7 years ago

The above requires a front-end Apache. This http://wifo5-03.informatik.uni-mannheim.de/pubby/ mentions “when running Pubby behind an Apache proxy” so that should be possible.
Pubby runs on Tomcat, and from this page https://tomcat.apache.org/tomcat-6.0-doc/proxy-howto.html it seems that Tomcat runs over Apache, so the same directives should already be applicable.

VladimirAlexiev commented 7 years ago

Currently http://data.crystalbridges.org/object/108 redirects to http://data.americanartcollaborative.org/cbm/object/108 (you can see this with curl -Iv http://data.crystalbridges.org/object/108). This makes it diffucult for museums to deploy since they need to mess with a web server.

My basic idea is as follows. I'm not even sure that playing with Apache proxy will be needed:

Each museum registers 54.69.252.89 (the IP of data.americanartcollaborative.org) in their DNS server, eg data.crystalbridges.org -> 54.69.252.89 Registering a DNS record is much easier than deploying a web server
Only branded URLs are used in semantic data
When someone makes a request for a branded URL (eg http://data.crystalbridges.org/object/108), the request path is transmitted as /object/108 but there is also Host: data.crystalbridges.org so Pubby knows the full URL

Put this in the Pubby config (see example). There must already be such file, we're just adding multiple conf:dataset:

<> a conf:Configuration;
conf:webBase <server_base_uri>;
conf:dataset
[conf:datasetBase <http://data.crystalbridges.org/>; conf:sparqlEndpoint <http://data.crystalbridges.org/sparql>],
[conf:datasetBase <http://data.autry.edu/>;          conf:sparqlEndpoint <http://data.autry.edu/sparql>].

(or we could put the same http://data.americanartcollaborative.org/sparql in all conf:sparqlEndpoint, I don't think that'll make any difference)

cbutcosk commented 7 years ago

That specific configuration won't work out of the box because Pubby does not consider the hostname when it constructs the request URI AFAICT, so if object/40 is present in both datasets it will only return data from one. ISI's version switches on a reponame to finesse the dataset/redirection, so it knows cbm/object/40 should be queried as <http://data.crystalbridges.org/object/40>--@VladimirAlexiev's configuration will work with it, but will still require URL rewriting/proxying.

My proposal would be to use ISI's pubby version with a similar configuration, but move URL rewriting and redirection up into the ISI instance. Without ISI's pubby, you would need 14 instances (pubby is that naive). Likewise if we can't avoid having some apache/ngnix instance doing routing, it may as well be owned and maintained in the hosting environment.

Adding a "Thar be dragons" to the apache config header would be optional.

(EFC)

cbutcosk commented 7 years ago

I had some time to work up a spec configuration for Apache in https://github.com/ColbyMuseum/aac-url-rewrite. It's pretty lightweight, using an apache module for just this use case: a simple inbound hostname to proxy destination mapping from a text file.

Hostnames would have to be added there and in the Pubby instance's configuration after an instituion registers the DNS of their branded hostname, but otherwise deployment and custom configuration is minimal.

VladimirAlexiev commented 7 years ago

hi @cbutcosk great work! In that repo you mention "Each instituion still needs a multiURIMapping entry in the pubby configuration." Googled this and figured out it's ISI's addition to pubby:

https://github.com/american-art/pubby
https://github.com/american-art/aac-alignment/wiki/Workflow-setup-and-usage#configttl The latter page also describes using proxy parameters. So looking at all this, my comments above are disjointed and imprecise... but I firmly believe it's possible to do it, without having 14 Pubby instances.

conf:multiURIMapping is in conf:dataset and Pubby supports multiple datasets, so it's a matter of passing the full request URL to it. And I think this is what your work does

VladimirAlexiev commented 7 years ago

@caknoblock @cbutcosk @workergnome What's the status of this issue? Tested http://data.crystalbridges.org/object/108 and the URL (in the address bar) is still rewritten to non-branded. (This test value is from https://github.com/american-art/PUAM/issues/30)

american-art / semantic-hosting

emit triples only against Branded URLs #4