DFC Connector Data Capture Feature & Store

RaggedStaff commented 5 months ago

Discussed in https://github.com/orgs/datafoodconsortium/discussions/30

^{Originally posted by **jgaehring** March 25, 2024} ## Objective Enable remote data capture functionality in the DFC connector, as requested by the FDC Governance Circle, so that data may be captured within the DFC Network and relayed to an independent triple store that will act as a Data Commons. ## Proposal While we discussed the possible necessity of incorporating the data capture mechanism into the code generator's templates, I've realized that may not ever be necessary or even desirable. In all three implementations of the connector, the core request/response logic can be found within the main `Connector` class or its modules (such as the `JsonldStream` importer and exporter in the case of the TypeScript implementation), which are all contained within the static code directories and not produced through code generation. Because these import/export methods are indirectly invoked by all semantic object subclasses' getters, setters, adders and removers, it would be the ideal place to inject optional hooks that could extend the import/export behavior. A good model for this kind of extension might be the axios library's [interceptor] pattern: ```js // Add a request interceptor axios.interceptors.request.use(function (config) { // Do something before request is sent return config; }, function (error) { // Do something with request error return Promise.reject(error); }); // Add a response interceptor axios.interceptors.response.use(function (response) { // Any status code that lie within the range of 2xx cause this function to trigger // Do something with response data return response; }, function (error) { // Any status codes that falls outside the range of 2xx cause this function to trigger // Do something with response error return Promise.reject(error); }); ``` Internally the axios interceptors are [private members] of the [`InterceptorManager`], with a separate instantiation for both request and response cycle. The interceptors can also be "ejected": ```js const myInterceptor = axios.interceptors.request.use(function () {/*...*/}); axios.interceptors.request.eject(myInterceptor); ``` [interceptor]: https://axios-http.com/docs/interceptors [private members]: https://github.com/axios/axios/blob/d844227411263fab39d447442879112f8b0c8de5/lib/core/Axios.js#L24-L28 [`InterceptorManager`]: https://github.com/axios/axios/blob/d844227411263fab39d447442879112f8b0c8de5/lib/core/InterceptorManager.js Some consideration should be given to the API for the connector and the corresponding getters and setters that will actually invoke the capturing logic. The getters and setter can differ in behavior, with some being synchronous and others asynchronous, while the capturing behavior will always be asynchronous. But we could generally take an approach such as the following: ```ts const loggerRef = connector.interceptors.import.use(logger); ``` Where `logger` could be a function (or two functions, to handle both success and error results), or an instance of a `Logger` class with a wider variety of configurable options, or both. As for the triple store, where logs will be sent, there are a lot of options. To begin, the DFC prototype could be used for running integration tests in the local development environment. If that achieves much of desired outcomes, a fork of that could be prepared for deployment. A more customized solution could be built with [SemApps], but might require more development. Another extenuating factor is the degree to which OFN's stakeholders would like this store to be integrated with OFN's core software and regional server instances. [SemApps]: https://semapps.org/ As for the triple store to send logs to, there are many options, depending upon how tightly integrated with the core OFN software and server instances OFN's stakeholders wish this to be, as opposed to a totally independent server that core OFN knows nothing about. It may be more difficult to judge with much accuracy the cost and time required to stand up a maintainable instance of the triple store based on these decisions and a more detailed conversation. In any case, however, the proposed logging interceptor should work just the same, since the only parameter it will strictly require should be a location to send the logs to. Different logging interceptors can be adapted to different behaviors as desired, and even combined, since this would enable multiple interceptors. The flexibility of the interceptor pattern may in fact allow for more incremental development of the triple store and how it is deployed to production. ## Requirements - Implement the `.import.use()` and `export.use()` method, a general interface for the function or `Interceptor` class they would each accept as arguments, and the implementations of those functions or classes as the actually `ImportLogger` and `ExportLogger`. Obviously, the names for all these classes and methods can be decided upon later. These will first be implemented in TypeScript. - Write appropriate unit tests for these interceptors and the data capture implementation(s), extending the existing TypeScript connector tests as appropriate. These will only mock the intended triple store behavior. - Pending further discussion, develop integration tests that can run against a local instance of a triple store, possibly based on the DFC prototype or SemApps, that can receive and store JSON-LD logs. Preferably this local instance will be containerized so it's easy to replicate on a staging server, or perhaps as the basis for store that can eventually go into production for the data commons. ## Milestones 1. TypeScript connector's `import.use()` and `export.use()` methods, interfaces, classes, and corresponding unit tests. 2. Local triple store and integration tests of the connector's interceptor API and the data capture interceptors specifically. 3. Staging server and/or production deployment of the triple store. ## Estimated Time and Cost Milestones 1 and 2 will each require roughly 15 hours of development time, and their order is more or less interchangeable. Depending decisions on how best to develop, test, and deploy the triple store, milestone 3 could vary widely, potentially as little as 6-12 development hours, or over 30 dev hours, if more customization is required beyond simply running an off-the-shelf solution. Similarly, milestone 4 is difficult to assess at this time, but would require at least the same amount of dev hours, possibly more. | # | Description | Dev Hrs | Est. Cost | Duration | | :---: | :----------------------------- | ------: | ------------: | :-------: | | 1 | Connector features | 24 - 30 | $2520 - $3150 | 1 - 2 wks | | 2 | Integration testing | 6 - 30 | $630-3150 | 1 - 3 wks | | 3 | Staging/production deployments | 12 - 60 | $1260-6300 | 2 - 6 wks | The contingencies in milestones 2 and 3 makes this a very imprecise estimation, __costing anywhere from $4,410 to $12,600 and taking 1 to 3 months to complete__. We can speak in further detail on the expectations for the triple store as we go ahead with the connector features, or wait until a clearer set of requirements can be determined for all 3 milestones.

Further discussions have highlighted that the Semantizer libraries are having functionality upgraded to support mixins. This is a dependency for this work: the Data Capture functionality will be included as a mixin.

jgaehring commented 4 months ago

I think to move forward there are two main blockers for now.

I need to consult with Maxime to understand better how mixins work in assemblee-virtuelle/semantizer-typescript, so that something compatible can be included into the TS connector.
An understanding of the production requirements for FDC Governance Circle, such as, where the triple store should be hosted, to what extent should it be integrated with the OFN UK instance, etc. This will help to narrow down the estimates for time and cost on Milestones 2 & 3 listed in the table above.

RaggedStaff commented 4 months ago

2. An understanding of the production requirements for FDC Governance Circle, such as, where the triple store should be hosted, to what extent should it be integrated with the OFN UK instance, etc. This will help to narrow down the estimates for time and cost on Milestones 2 & 3 listed in the table above.

@jgaehring The triple store will be separate from all participating platforms. I'd have a preference to stand something up on the Infomaniak's Jelastic Cloud instance we're using to host the Shopify apps.

At this stage we aren't trying to integrate with anything... just (securely) store the data somewhere, so it can be managed, by the members, as their data commons in the future.

Lets have a quick chat about what might work...are you around tomorrow? I'm free 1-2pm or 4-4:30 (UK) .

On the other blocker - @lecoqlibre is on vacation this week, but I think around next week... maybe we should all talk together next week?

jgaehring commented 3 months ago

I need to consult with Maxime to understand better how mixins work in assemblee-virtuelle/semantizer-typescript, so that something compatible can be included into the TS connector.

For my own sake, I'm just noting the snapshot of the semantizer's mixin implementation as it stands right now, although it is considered unstable:

https://github.com/assemblee-virtuelle/semantizer-typescript/blob/61c5ddbcde51fbc7469ac315169ac8b42a74d194/src/test/src/index.ts

RaggedStaff commented 2 months ago

@jgaehring Notes from our call:

We agreed to modify the export function(s) in the static area of connector-codegen (for ts, ruby & php), to check a parameter & if TRUE, we POST the exported JSONLD to our triple store.

We'll start with the PHP verion (Big Barn), then TS, then Ruby.

jgaehring commented 2 months ago

As discussed in today's tech call, this is the relevant part of the TypeScript codegen implementation (pending merge of PR #20) where the call to semantizer's .export() method will be wrapped with the Data Capture logic, which basically just needs to call .export() again with the new destination:

https://github.com/datafoodconsortium/connector-codegen/blob/2c8507af85919862669ef8a989bb2679f553dc78/src/org/datafoodconsortium/connector/codegen/typescript/static/src/Connector.ts#L182-L188

That "wrapper" can be moved lower down the stack to the internals of the semantizer, once it reaches its next stable release, but that later change shouldn't require breaking changes to either the connector or the semantizer's APIs. Therefore, I believe there should be no problem implementing the data capture feature with the existing alpha version of the semantizer, since costs prohibit that being upgraded in the near future regardless, without incurring significant tech debt once the stable release becomes available.

lecoqlibre commented 1 month ago

What do you think about using the observer pattern to decouple the data-capture feature from the connector itself?

We would have a method to register a new observer for the export method like connector.registerCallbackForExport(callback: (exported: string) => void).

Each time the connector.export() method will be called, a new callback will be triggered. This mechanism can be used for any other export-related feature.

In the client code you want to capture data from, you will just have register a handler of your choice (which can be implemented is a separated package and even in a DFC related one if you want like @datafoodconsortium/connector-data-capture).

You can also export a pre-configured Connector class from this package so your clients can just import it without configure it:

import { Connector } from "@datafoodconsortium/connector-data-capture";

const connector = new Connector();

connector.export(...); // this will trigger the data-capture handler

@jgaehring @RaggedStaff

datafoodconsortium / connector-codegen

DFC Connector Data Capture Feature & Store #24

Discussed in https://github.com/orgs/datafoodconsortium/discussions/30