dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
860 stars 269 forks source link

User friendly Linked Data for HTTPS identifiers #718

Open namedgraph opened 3 years ago

namedgraph commented 3 years ago

Issue validity

Live data on dbpedia.org.

Error Description

There is a http:///https:// mismatch between requested URIs and the URIs in the data.

Details

Originally reported here: https://sourceforge.net/p/dbpedia/mailman/message/37362683/

The server forces https:// URLs:

$ curl -I -H "Accept: text/turtle" http://dbpedia.org/resource/Copenhagen
HTTP/1.1 303 See Other
Server: nginx/1.18.0
Date: Thu, 07 Oct 2021 09:11:29 GMT
Content-Type: text/html
Content-Length: 153
Connection: keep-alive
Location: https://dbpedia.org/resource/Copenhagen
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers:
Depth,DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding

But the returned RDF data contains http:// URIs:

$ curl -o - https://dbpedia.org/data/Copenhagen.ttl
@prefix dbo:    <http://dbpedia.org/ontology/> .
@prefix dbr:    <http://dbpedia.org/resource/> .
<http://dbpedia.org/resource/2011\u201312_West_Ham_United_F.C._season>
 dbo:wikiPageWikiLink    dbr:Copenhagen .
<http://dbpedia.org/resource/AEK_Athens_F.C._in_European_football>
 dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:Adform      dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:Helena_Paparizou    dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:MS_Jutlandia        dbo:wikiPageWikiLink    dbr:Copenhagen .

Another example, this time requesting https://:

$ curl -L -OJ -H "Accept: text/turtle" https://dbpedia.org/resource/Copenhagen
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   153  100   153    0     0    725      0 --:--:-- --:--:-- --:--:--   725
100  675k  100  675k    0     0  1139k      0 --:--:-- --:--:-- --:--:-- 3235k
curl: Saved to filename 'sparql_2021-10-29_10-31-22Z.ttl'

$ cat sparql_2021-10-29_10-31-22Z.ttl
@prefix dbo:    <http://dbpedia.org/ontology/> .
@prefix dbr:    <http://dbpedia.org/resource/> .
dbr:Vivi_Bach   dbo:birthPlace  dbr:Copenhagen .
...
JJ-Author commented 3 years ago

Hi @namedgraph (nice user name by the way) can you please elaborate, why you think that DBpedia Linked Data interface is broken? I consider the HTTPS URL of a resource just as a special "generic document" that describes the non-information URI (NIR) aka "resource ID". See the image below from cool URIs. In other words our resource IDs are non-HTTPS. HTTPS is just used as (mandatory - this might be discussed) security layer.

image

namedgraph commented 3 years ago

Linked Data is about self-describing resources. If http://dbpedia.org/resource/Copenhagen is requested, RDF data with http://dbpedia.org/resource/Copenhagen in the subject position (and possibly additional resource descriptions) should be returned. If https://dbpedia.org/resource/Copenhagen is requested, RDF data about https://dbpedia.org/resource/Copenhagen should be returned. http://dbpedia.org/resource/Copenhagen and https://dbpedia.org/resource/Copenhagen are two distinct resources in RDF since their URIs differ.

As my examples show, when http:// is requested, the server redirects to https:// but then returns data about http:// anyway. When https:// is requested, the data is still about http://.

See the email thread for more details.

JJ-Author commented 3 years ago

@namedgraph there seems to be still a lot of confusion here.

From an RDF perspective https://dbpedia.org/resource/Berlin does not exist as a resource. It is only the URL of the generic document that delivers the description ( of http://dbpedia.org/resource/Berlin). We don't use https based RDF resource identifiers because of the simple reason you mentioned (string identity in RDF) -- so far. So again http://dbpedia.org/resource/ is the RDF namespace and https://dbpedia.org/resource/ is no RDF namespace (and these https URIs should never occur in any kind of RDF data, and therefore should be never looked up by any linked client directly!) To be more clear lets have a look again at the Alice example from above which translates to the following.

http://dbpedia.org/resource/Berlin ~ http://www.example.com/id/alice https://dbpedia.org/resource/Berlin ~ http://www.example.com/doc/alice https://dbpedia.org/data/Berlin.ttl ~ http://www.example.com/doc/alice.rdf

I see that this might be not so clear on the very first look since both namespaces look very similiar and not so explicitly different as in the cool uris example.

Based on your email conversation and this github issue I understood the following problems / request. But in the end we need you to show what actual problems do you have. So which particular client breaks and why.

C: But when looking at the redirect chain I think I identified an actual problem. Fallback to http which does not make sense to me (?) @pkleef @kurzum maybe this is what actually break clients (I remember if you download files with native java from the databus/collections with the databus file identifiers which use https, you can have a problem with redirects that point to non-https download locations (so download url is not https) https://github.com/dbpedia/dbpedia-databus-collection-downloader/commit/609102199ab4ebc3217ae05a71a08a3d8fd267e1) ~~ http://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/resource/Berlin --[303]--> http://dbpedia.org/data/Berlin.ttl -[303]-> https://dbpedia.org/data/Berlin.ttl Fix option 1: https not enforced http://dbpedia.org/resource/Berlin --[303]--> http://dbpedia.org/data/Berlin.ttl Fix option 2: https enforced http://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/data/Berlin.ttl see https://github.com/dbpedia/extraction-framework/issues/722

namedgraph commented 2 years ago

So essentially DBPedia's http:// identifiers are canonical, and https:// should not be used and only occur behind the scenes during the redirects?

jaygray0919 commented 2 years ago

We also have encountered variations on this issue. Browsers increasingly look deep into a web transaction. If the browser detects an http:// resource it might get flagged (or blocked). This was true when using SPARQLer (recently upgraded to https://). However, we've seen instances of http:// endpoints in SPARQL queries fail when fetched using http://

namedgraph commented 2 years ago

I think the easiest way to encounter this issue is just to grab the URL from the browser's address bar, which after the redirects is the https:// URL, and then use it somewhere else, like in a Linked Data browser.

You can rationalize that "this is not the canonical URL", but people just expect it to work.

JJ-Author commented 2 years ago

I agree the Linked Data and Semantic Web practices and standards are quite old, not easy to understand and not always super user friendly. IMO it was not designed to be consumed by humans and use cases like your copy and paste browser usage. DBpedia exists since 2007 and the feature you request has a lot of pitfalls and can break a lot of things or make the identifiers even more confusing or just wrong in the future (if you copy it from the browser you get the ID of the html page, not of the entity, sorry but that is just a semantic difference that is in place for a very long time, not DAU friendly though I totally see that). If a project starts from scratch now it can just go with HTTPS-only identifiers and then all this trouble is not an issue

I tried it with Wikidata and what you request also seems not to work there neither via SPARQL nor via Linked Data Also Github has a separate "raw" namespace to download files and separates between files content and html presenation of the file.

To move forward, I spitted the issue into the "actual" bug I discovered (https://github.com/dbpedia/extraction-framework/issues/722) and your feature request.

namedgraph commented 2 years ago

I wouldn't blame the Semantic Web for this, as RDF doesn't really care about http:// or https:// :)

I would attribute this to legacy conventions/technical debt. As you mentioned the issue would be solved by making https:// canonical.

namedgraph commented 2 years ago

@JJ-Author another problem with http:// as canonical URIs is that they cannot be requested from a secure page.