Open danbri opened 2 years ago
Currently, it is only generating an RDF-like view of the DOM tree.
In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).
In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:
We could use http://any23.apache.org -- other ideas?
Thanks. You might look at https://github.com/wbsg-uni-mannheim/WDCFramework/blob/master/pom.xml since they extract these formats and seem to build upon any23
Named graphs makes sense to distinguish the different syntax sources
UK Guardian newspaper pages are usually good if you want to find examples of json-ld and microdata in the same page. Or at least used to be.
On Tue, 30 Nov 2021 at 10:19, Enrico Daga @.***> wrote:
Currently, it is only generating an RDF-like view of the DOM tree.
In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).
In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include:
- RDFa
- Microdata
- Microformats
- Others?
We could use http://any23.apache.org -- other ideas?
— You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-982492911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGILFJJ6G2MIRGGT2LTUOSQMXANCNFSM5I66EMEQ .
This relates to #13
With dcc589e SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.
That's fantastic - nice work!
On Sat, 11 Dec 2021, 08:31 luigi-asprino, @.***> wrote:
With dcc589e https://github.com/SPARQL-Anything/sparql.anything/commit/dcc589e8cfffe681014ea883def4ab8b4b5481ab SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SPARQL-Anything/sparql.anything/issues/164#issuecomment-991538143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJSGNLPSPOSTGXWJWWODDUQMD43ANCNFSM5I66EMEQ .
Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction.
Any23 should use the HTTP client of SA.
Any23.setHTTPClient
However, this means that we need to make a public method Triplifier.getHTTPClient
, which we don't have at the moment.
Any23 should use the HTTP client of SA.
Any23.setHTTPClient
However, this means that we need to make a public method
Triplifier.getHTTPClient
, which we don't have at the moment.
However, I would prefer to just pass an InputStream to Any23, really.
cool i do see the embedded json-ld (which uses schema.org) from IMDB now.
curl --silent 'http://localhost:3000/sparql.anything' \
-H 'Accept: text/csv' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
select *
# construct {?s ?p ?o}
WHERE {
service <x-sparql-anything:>{
fx:properties fx:location "https://www.imdb.com/title/tt1160419/" .
fx:properties fx:media-type "text/html" .
fx:properties fx:html.metadata "true" .
graph ?g {?s ?p ?o .}
}
}'
yields:
s,p,o,g
...
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/url,https://www.imdb.com/title/tt1160419/,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/site_name,IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/title,Dune (2021) - IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/description,"Dune: Directed by Denis Villeneuve. With Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa. Feature adaptation of Frank Herbert's science fiction novel about the son of a noble family entrusted with the protection of the most valuable asset and most vital element in the galaxy.",https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/type,video.movie,https://www.imdb.com/title/tt1160419/
...
it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).
ops i missed them in the snippet but they are there.
EDIT
here they are
s,p,o,g
_:b0,http://schema.org/actor,_:b1,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b2,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b3,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/aggregateRating,_:b4,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/alternateName,Dune,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/contentRating,PG-13,https://www.imdb.com/title/tt1160419/
...
it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).
Yes, this is the plan
A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?