filecoin-project / devgrants

👟 Apply for a Filecoin devgrant. Help build the Filecoin ecosystem!
Other
375 stars 308 forks source link

IPFS-backed Semantic Web / Distributed SPARQL event log and distributed compute #1566

Closed andrewzhurov closed 1 year ago

andrewzhurov commented 1 year ago

Open Grant Proposal: IPFS-backed Semantic Web / Distributed SPARQL event log and distributed compute

Project Name: Web3Query (temporary name, is a work in progress)

Proposal Category: Developer and Data Tooling & Integrations (please advice which is best suited)

Individual or Entity Name: Individual

Proposer: andrewzhurov

Do you agree to open source all work you do on behalf of this RFP under the MIT/Apache-2 dual-license?: Yes

Project Overview

Web3Query architecture diagram sketch

Project Summary

The goal of the project is to provide web developers a framework for building Semantic Web applications on top of distributed personal event logs. This architecture resembles that of Solid, only if a Solid Pod contained not an RDF Store but an event log, out of which an RDF Store would be derived (much like databases that often capture transactions as log and derive an SQL/noSQL view out of it).

The log is content-addressable, which gives us immutability and spares the need to depend on a location as a way of addressing, as anyone can download the exact version of the log by its hash, eliminating link rot. The log is stored on IPFS, so anyone can access the exact version of it by its hash and there is no need to keep a dedicated server for it, eliminating the need for trust and the need to keep a dedicated server where to store it. The log is replicated between peers that wish to collaborate on it (be it across devices of a user or across multiple users), and can be persistent to Web3Storage to increase availability when no peers are online. The log is implemented on top of OrbitDB and is a CRDT, which allows peers to contribute to their local replicas (offline-first) and sync up with others, achieving eventual consistency (peers will eventually end up with the same version of the log, deriving the same RDF Store out of it), and because the log captures events and not data it has transactional guarantees (transactions, issued in parallel, get serialized (are executed as though they are executed in sequence)), eliminating conflicts. Derivation of RDF Store resembles a pure function out of event log and event handlers. deriveRDFStore(eventLog, eventHandlers) In order to increase performance a dependency graph of transactions can be derived, and evaluation can be parallelized whenever possible. In addition to that, evaluation results can be cached and made publicly available, allowing for peers to adopt evaluation results of others rather than performing it themselves (which becomes increasingly more valuable as logs become ever-bigger and as more peers wish to work on them). This opens up possibility for distributed compute, as peers may orchestrate evaluation of event log and incentivize third-parties to contribute (e.g., delegating expensive computation to third-parties that may be tailored for such computation (e.g., to parties offering analytical computation on GPUs)).

One of the goals is to make this framework as easy as possible to use for developers. For that they are provided with LDFlex interface that is a dot.notation interface for assembling SPARQL (a kind of jQuery for SPARQL). But rather than executing LDFlex expressions on the client, they are captured as data (JSON) and dispatched to an event log, and later on executed client-side by each peer. This, in contrast to capturing exact SPARQL transactions in a log, allows for: 1) more easily validate what gets accreted to an event log 2) supply custom event handlers to derive the exact SPARQL, allowing for upgradeability of domain model and decomplecting user's intent (an event) and domain model - developers do not hardcode notion of a domain model into their applications. 3) over time alignment of domain models across logs, increasing potential for interoperability (there are many ontologies that describe the same thing, and developers are enabled to switch between them as they see fit) Increased interoperability is valuable in the world of linked data

Finally, since logs are immutable, interlinking between logs adopt the same trait, allowing to know the exact data that's been linked to, which is a very important train for a knowledge system - if we refer to something - we want it to stay the same - a robust foundation.

In addition, it becomes possible to issue as-of queries, querying a log a particular point in item. And to top it up, log entries are signed by its author, and we have greater provenance - knowing where data comes from. As an additional feature we may allow log entries to be supplemented with validTime - allowing to specify when an event is meant to take place - allowing for adding events to the past or future, and overriding events, achieving bitemporality, much like it's done in XTDB.

To have a better UX the framework allows for reactive data flow - that is whenever an event occurs - store gets recalculated - client-side queries get recalculated and views are supplied with the new value. This is akin to Redux (or re-frame) - the Flux architecture.

Overall, this projects aims to give developers easy interface to work with Semantic Web on top of immutable foundation, embracing p2p collaboration and back-end-less app architecture. For users it offers those values of Solid that are 1) data sovereignty - escaping walled gardens 2) application interoperability 3) interlinking data; as well as the values they get from p2p collaboration, sparing the need for keeping their data in a specific location, providing a more robust data foundation (without link rot), increased provenance, and capturing user's intent that allows for dynamic data models.

Impact

The goal of the project is to provide an easy to use framework for building nex gen Web apps with Semantic Web, backed by OrbitDB and IPFS.

In risk of repeating myself

Values from using RDF

RDF is a graph data model that allows for great flexibility in modeling and interlinking. IPFS developers will be empowered with graph-based tools that will allow them to model their application data with greater expressivity. In addition applications are not constrained to work on personal data of one user, as due to interlinking it'll be possible to refer to data of others, creating one global graph out of personal graphs. RDF is mature standard that gets improved over time (e.g., latest work of RDF Star working group aims to add RDF* as part of the RDF spec, which will allow for greater expressivity), and we can expect existing tools that we'll provide to developers (LDflex and Comunica) to add support for it and other future enhancements.

Values from using Comunica

Comunica is a federated query framework, which allows to run SPARQL queries over multiple data sources, this yet again allows to not be constrained by data source of one user. Comunica executes SPARQL queries on the client, right from the browser, this allows applications to be back-end-less, complementing p2p design of apps. Comunica can stream query results, which may be of value for heavy queries, allowing for snappier UX, as users don't need to wait until the query finished before seeing some results.

Values from using LDflex

LDflex aims to be an easy interface to RDF for web developers, giving them well familiar dot.notation to query and transact RDF. and is well suited for simple use-cases (which are most of use-cases found in applications), it can be powered by Comunica, giving us the same federated querying. This will give IPFS developers perhaps the easiest interface for building their apps. In addition, it decouples app from domain model, as been mentioned before.

Values from using distributed event log as CRDT

Event log captures user's intent rather than a derived from it data representation. This allows to derive different data representations as seen fit, be it for the purpose of adopting the same ontology, increasing apps interoperability, or for the purpose of deriving a data model tailored for use-case (e.g., deriving a data model tailored for efficient querying on GPUs (e.g., in Apache Arrow tabular format)). In addition, events are easier to reason about (including programmatically) , that may be of use for example to constraint what kind of events are allowed in a store. Also it allows for transactional guarantees, since event log is a CRDT and gets serialized (executed as though in sequential order), where SPARQL transactions get derived out of events, we end up with eventually consistent across peers transaction log. This allows to issue (potentially conflicting) events in parallel and have them conflict-resolved without any effort required from developers.

Having eventually consistent log allows for offline-first apps, easing it for developers and giving users responsive UX.

Event log gives us provenance of where data comes from, who's its author. As each data item (RDF triplet) captures which event it's been derived from in its metadata (via RDF*). And an event contains author and are signed by him/her. So we can trail back where each data item comes from. E.g., Alice made friends with Bob on 1st June 2023 by issuing

await bob.befriend()
<<:Alice :friends-with :Bob>> :derived-from-event "hash of event"

Immutable event logs allow to have immutability on data level without the burden of managing immutable data directly. As each data item that references a node has hash of this node as of the time of reference, in its metadata. E.g., The exact state of referenced :Bob node is captured by hash, added to event and is available in RDF

<<:Alice :friends-with :Bob>> :content-hash "hash of Bob's entry"

Then Alice will always be able to get the exact content she referenced, if she so wishes, and trail back to where this content came from. This, not having link rot, is a must for a robust knowledge foundation, as it seems to me.

Support for metadata in RDF just been made official by the recently published RDF1.2 draft, and it's already been implemented in Comunica, I belive having support added to LDflex is a matter of near time and I may even contribute to it.

In addition, immutable event logs allow as-of txTime queries. Speaking of temporality,

Value from having bitemporality of events

Adding support of validTime for events will allow to insert them into past or future and supersede existing events, giving control over how the serialized log should look like. Benefits of bitemporality are exellently described by XTDB. Bitemporality allows as-of validTime queries. In addition, temporal information is kept as metadata per each data item. E.g.,

<<:Alice :friends-with :Bob>> :valid-from "2023-06-01"

This allows to execute temporal queries, having access to temporal data from within the queries! E.g., Alice can find all friends she made during June 2023, returning Bob.

Removing data item adds :valid-to metadata, so the information is not removed and can be inspected in historical queries (e.g., the query above will still return Bob), but won't appear in as-of-now queries. E.g., Alice on 1st July 2023 issues

await bob.unfriend()

It'll be captured as metadata

<<:Alice :friends-with :Bob>> :valid-to "2023-07-01"

Then querying for friends of Alice as of validTime 1st July 2023 won't return Bob.

There are many use-cases for bitemporality, complex and simple ones such as

await alice.friends.bob.validFrom

returning "2023-06-01". This can be used on any fact out there. Or

 await alice.posts.latest.history

returning all versions of the latest post that Alice made.

Summing up, bitemporality is a powerful tool, having it as first-class citizen in our db should make it a robust foundation for knowledge management, open up interesting use-cases, and be a pleasure to work with.

Values from having distributed compute of event logs

Allowing users to orchestrate computations may open new use-cases for applications. Such as user may run computation on laptop and use their results on mobile, increasing speed and reducing battery consumption. Such as users may collaborate on expensive computations.

Values from having Compute as a Service

Some expensive computations could be delegated to dedicated compute providers (that say run it on GPUs) drastically increasing speed. This allows for new use cases, such as researchers working on big data. This create an ecosystem, attracting more actors and funds in IPFS ecosystem, increasing its growth.

Other values

This framework can be adopted by Semantic Web developers, attracting more fine engineers towards the IPFS ecosystem. Also, it seems close to easy to migrate existing Semantic Web apps that use LDflex over to IPFS backend. Also, we can expect more users to be attracted to manage their data with IPFS.

Risks of not getting it right

Shipping half-baked solution may create bad reputation and incline developers againts using the framework and users from using IPFS and Semantic Web backed applications. We need to nail it.

Distributed compute may be maliciously used to distribute incorrect results. Designing a proper trust system of results will be crucial.

What impact will this project have in a specific vertical, market, or ecosystem?

Over time users will be pulled towards the ecosystem of personal stores (that vision of Solid), moving away from siloes of Web2 apps. This framework seems to provide a robust data foundation (immutable, without link rot and with provenance) that will be of use for building humanity's knowledge on top. Perhaps over time Web2 will migrate over to this data layer.

What does success look like?

Our project is an opensource framework that is meant to be used for development of the next gen web apps. Primary actors that we need to engage are web developers. The project needs to provide them with value and as little cost (educational) as possible. Success would mean a wide adoption by developers, emerging apps built with it, and subsequently more users.

Secondary actors are compute providers and those that may need intensive computations to be made (e.g., researchers, operating on big data).

Overall the ecosystem closely resembles that of Solid project - create motivational gravity for developers and users to adopt this new way of building web apps and managing data.

Outcomes

Our project is an opensource framework that is meant to be used for development of the next gen web apps. Primary actors that we need to engage are web developers. The project needs to provide them with value and as little cost (educational) as possible. Success would mean a wide adoption by developers, emerging apps built with it, and subsequently more users.

Secondary actors are compute providers and those that may need expensive computations to be made (e.g., researchers, operating on big data).

Overall the ecosystem closely resembles that of Solid project - create motivational gravity for developers and users to adopt this new way of building web apps and managing data.

Adoption, Reach, and Growth Strategies

Primarily actors are web developers, to engage with them we'll need to demonstrate values of the framework, that can be done in a form of demo applications, video demos, video and text tutorials.

When demo material is available it can be distributed by any channels, be it Youtube, course platforms, conferences, social groups.

Development Roadmap

Milestone 1: Design and setup (2 months, June 1 - July 31, 2023)

In this phase, the core infrastructure will be laid down, and the overall design will be finalized, taking into account future scalability and performance.

Expected outcome: Detailed design document, setup of the development environment, infrastructure, and version control. A basic prototype to validate the architecture and core concepts.

Roles and responsibilities:

Funding required: $12000

Milestone 2: Core functionality development (4 months, August 1 - November 30, 2023)

In this phase, the team will implement the main functions of the system, such as the event log mechanism, RDF Store derivation, dependency graph derivation, etc.

Expected outcome: A functioning software with core features implemented, allowing basic use-cases to be achieved.

Roles and responsibilities:

Funding required: $24000

Milestone 3: Extended functionality and optimization (4 months, December 1, 2023 - March 31, 2024)

The focus in this phase will be on enhancing the system's capabilities and optimizing its performance. Distributed computation and caching will be implemented.

Expected outcome: The system is capable of handling more complex use-cases and performs well under load.

Roles and responsibilities:

Funding required: $24000

Milestone 4: Final testing, documentation, release, and tutorial recording (2 months, April 1 - May 31, 2024)

This phase is dedicated to conducting comprehensive testing, writing documentation, preparing for the software's release, and recording instructional tutorials.

Expected outcome: The system is thoroughly tested, well-documented, ready for release, and equipped with tutorial materials.

Roles and responsibilities:

Funding required: $12000

Total Budget Requested

Milestone # Description Funding
1 Design and setup $12000
2 Core functionality development $24000
3 Extended functionality and optimization $24000
4 Final testing, documentation, release, and tutorial recording $12000
Hardware Budget $4000
Total $76000

We request a total budget of $76000 for the successful completion of this project.

Maintenance and Upgrade Plans

Post-release, the team will focus on maintaining the software, providing support to users, and implementing upgrades based on user feedback and technology advancements. The project will be open source, and we will encourage community contributions to enhance the system.

Team

Team Members

Relevant Experience

Andrew Zhurov, the Lead Developer, is a software developer experienced in writing full-stack web applications with Clojure and ClojureScript, functional programming languages that embrace immutability. Andrew has worked with Datomic, a graph database akin to RDF with a Datalog interface (similar to SPARQL) that allows as-of queries. He has been a part of an excellent team of a Cambridge-based company.

Andrew has made small contributions to Comunica - a client-side Semantic Web graph querying framework, and is looking to contribute on LDFlex. He contributed to Logseq ideas and a bit of code. Logseq is a graph-fashioned personal knowledge management application.

He is passionate about reimagining what web development can be, for that in the past he's been researching and developing an alternative application model as part of brawl-haus pet project, which embraces event log as the source of truth and offers reactive dataflow down to clients' views.

For the past 2+ years from his personal funding, he's been focused on searching for a more robust foundation for the Web. He believes that Semantic Web, paired with immutability offered by projects such as IPFS, CRDT and p2p collaboration on event logs and reactive dataflow, could be the best solution. He would like to carry on with full-focused R&D on the topic and build a prototype of the framework within the next 12 months.

Team code repositories

Andrew's previous work can be found at: github.com/andrewzhurov

Additional Information

I learned about the Open Grants Program through online research while looking for funding opportunities for open-source projects. For further discussions and general next steps, please feel free to contact me via email at zhurov.andrew@gmail.com or via IPFS discord @andrewzhurov.

ErinOCon commented 1 year ago

Hi @andrewzhurov, thank you for your proposal and for your patience with our review. We will not be proceeding with a grant at this time, but would be happy to evaluate a new proposal after the official orbit db release with the fixes and integration completed. A user growth plan would also be helpful in the future!

Congrats on all the work accomplished so far. Wishing you the best as you continue to build!