[HTML] Add Schema.org and other inline rdf support

danbri commented 2 years ago

A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?

enridaga commented 2 years ago

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:

RDFa
Microdata
Microformats
Others?

We could use http://any23.apache.org -- other ideas?

danbri commented 2 years ago

Thanks. You might look at https://github.com/wbsg-uni-mannheim/WDCFramework/blob/master/pom.xml since they extract these formats and seem to build upon any23

Named graphs makes sense to distinguish the different syntax sources

UK Guardian newspaper pages are usually good if you want to find examples of json-ld and microdata in the same page. Or at least used to be.

On Tue, 30 Nov 2021 at 10:19, Enrico Daga @.***> wrote:

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:

RDFa

Microdata

Microformats

Others?

We could use http://any23.apache.org -- other ideas?

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-982492911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGILFJJ6G2MIRGGT2LTUOSQMXANCNFSM5I66EMEQ .

luigi-asprino commented 2 years ago

This relates to #13

luigi-asprino commented 2 years ago

With dcc589e SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

danbri commented 2 years ago

That's fantastic - nice work!

On Sat, 11 Dec 2021, 08:31 luigi-asprino, @.***> wrote:

With dcc589e https://github.com/SPARQL-Anything/sparql.anything/commit/dcc589e8cfffe681014ea883def4ab8b4b5481ab SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-991538143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGNLPSPOSTGXWJWWODDUQMD43ANCNFSM5I66EMEQ .

enridaga commented 2 years ago

Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction.

enridaga commented 2 years ago

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

enridaga commented 2 years ago

Any23 should use the HTTP client of SA.
Any23.setHTTPClient
However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

However, I would prefer to just pass an InputStream to Any23, really.

justin2004 commented 2 years ago

cool i do see the embedded json-ld (which uses schema.org) from IMDB now.

curl --silent 'http://localhost:3000/sparql.anything'  \
-H 'Accept: text/csv' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
select *
# construct {?s ?p ?o}
WHERE {
service <x-sparql-anything:>{
    fx:properties fx:location "https://www.imdb.com/title/tt1160419/" .
    fx:properties fx:media-type "text/html" .
    fx:properties fx:html.metadata "true" .
    graph ?g {?s ?p ?o .}
}
}'

yields:

s,p,o,g

...
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/url,https://www.imdb.com/title/tt1160419/,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/site_name,IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/title,Dune (2021) - IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/description,"Dune: Directed by Denis Villeneuve. With Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa. Feature adaptation of Frank Herbert's science fiction novel about the son of a noble family entrusted with the protection of the most valuable asset and most vital element in the galaxy.",https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/type,video.movie,https://www.imdb.com/title/tt1160419/
...

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

justin2004 commented 2 years ago

ops i missed them in the snippet but they are there.

EDIT

here they are

s,p,o,g
_:b0,http://schema.org/actor,_:b1,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b2,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b3,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/aggregateRating,_:b4,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/alternateName,Dune,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/contentRating,PG-13,https://www.imdb.com/title/tt1160419/
...

enridaga commented 2 years ago

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

Yes, this is the plan

luigi-asprino commented 2 months ago

231fb35 includes a test for RDFa, which passes 6de3a47 includes a test for microformats which fails. I'm not familiar with microformats. So currently only microdata and RDFa are supported. Any23 should have extractors for microformats, but I couldn't get it to work.

SPARQL-Anything / sparql.anything

[HTML] Add Schema.org and other inline rdf support #164