Using HDT (and other 'hybrid' data) on a hybrid Pod

j-steinbach commented 1 year ago

Pitch

The what's in a pod vision interprets a Solid pod as a hybrid knowledge graph (KG).
It is possible to store both raw data/documents and RDF triples
But what if a document is both? E.g. a HDT file is compressed and serialized RDF data.
How is this data to be used? Does it belong to the KG by default? How do we read and traverse it?
- If it is part of the pod KG, then how do we extend/add more triples to it?
- If it is not, then how do we store and use big amounts of data/triples on a pod?
- Querying a HDT with Comunica is faster/more efficient than querying the Turtle file. Depending on the machine/memory, Comunica will often not even be able to query big collections and fail with a OOM error (10mil triples, for example dbnary).

Desired solution

Be able to put a HDT file (or similar) on a Pod, query it and extend it like it was part of the KG.
Have it interoperate with Turtle/Quads/... (look them up at the same time)
Also be able to use it as a 'regular' file. (E: This should also work the other way around. Can we use/edit/display e.g. Turtle files as regular text files?)

Acceptance criteria

Have a HDT file one a pod together with some non-HDT triples (a Turtle file)
Have both files interoperate (get traversed/queries)
Extend the HDT file (how?)

Pointers

This might be a CSS issue
This could also be relevant in the context of 'plug&play' RDF data -- people can 'extend' their pod with a hashed, signed HDT file, in cases where the remote LOD server or aggregator is not trustworthy. This also gives the pod owner more control over the data (in case the remote LOD server gets shut down or aquired)

Scenarios

Use-Case / Origin

I want to put the Wiktionary data on a pod and then be able to re-create dictionary entries from the RDF data. I also want to be able to extend/annotate the dictionary entries (add new triples: my own example sentences, related words, ...) and export the data.

[The data is available as .ttl and .hdt. Comunica fails to read/query the Turtle data because it goes OOM (locally on the CLI, 16 GB RAM). The HDT however works.]

rubensworks commented 1 year ago

HDT would definitely be a good match as back-end for certain Solid use cases (mostly for non-write-intensive cases, since HDT doesn't support updates).

Related to this there is the need for being able to expose a query interface at pod-level (or container-level) that could be backed by triple stores such as HDT (https://github.com/SolidLabResearch/Challenges/issues/43). This would remove the requirement on the client to understand HDT (which can be quite tricky), and only having to interact with the query API.

Related work:

Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., ... & Colpaert, P. (2016). Triple pattern fragments: a low-cost knowledge graph interface for the web. Journal of Web Semantics, 37, 184-206.
Azzam, Amr, et al. "SMART-KG: hybrid shipping for SPARQL querying on the web." Proceedings of The Web Conference 2020. 2020.
Azzam, Amr, et al. "WiseKG: Balanced access to web knowledge graphs." Proceedings of the Web Conference 2021. 2021.

j-steinbach commented 1 year ago

(Unrelated, but maybe also interesting: Is it possible to export parts of the KG? Maybe as HDT :))

E: Similar to how we select tables in SQL and then export them. Create a view > export.

rubensworks commented 1 year ago

(Unrelated, but maybe also interesting: Is it possible to export parts of the KG? Maybe as HDT :)) E: Similar to how we select tables in SQL and then export them. Create a view > export.

Certainly, such materialized views are really interesting for query optimization.

pheyvaer commented 1 year ago

[ ] The acceptance criteria have to be more concrete. It has to be a list of steps that the user should be able to complete once the solution is provided.
[ ] Scenarios need to be in a separate issue. There is a template for scenarios that you need to use.

SolidLabResearch / Challenges