atomicdata-dev / atomic-data-docs

Atomic Data is a specification to make it easier to exchange data.
https://docs.atomicdata.dev
MIT License
17 stars 7 forks source link

Towards Atomizer: tooling for producing atomic data - convert / import / extract data #89

Open joepio opened 2 years ago

joepio commented 2 years ago

If we want Atomic Data to be successful, we need easy, accessible methods for people to convert their existing data to Atomic Data.

I call this project: Atomizer.

We need to consider a couple of things:

Usecases and examples

Process

What needs to happen to convert a piece of data to Atomic Data?

Map the properties to Properties and convert the value

This is probably the hardest, most expensive step when dealing with arbitrary new data. The converter (person doing the atomization) will need to either find or create a Property.

We can choose to automatically generate Properties for unknown fields, or we can ask the user for input.

Prevent duplicate imports

Create URL for resource

We first need to know the origin / path of the URL.

Various strategies exist:

Set a Parent / define the hierarchy

Parents help to give structure to data, and to set authorization / ownership. They are optional, but highly recommended. I think the Atomizer should set Parents for all created resources, so the data is owned by something. As a fallback, we could always create an import resource which is the parent of all the imported resources, and crate an imports collection which is the parent of all imports. But for many cases, we should be able to find more intuitive hierarchies.

Many common existing data formats are nested, such as XML documents. In these, we often know what the parent is. But we still need some parent above each imported instance.

In the case of importing folders, it might make sense to create Folder resources and set these as parents to the resources inside.

Host the data somewhere on the web

This is something that Atomic-Server makes easy, but would otherwise be kind of a hassle.

Architecture

Some considerations:

Approaches:

atomizer Rust libraries for importers, powered by atomic_lib

I'll create a new atomizer repo, which has a CLI binary (atomizer-cli) and a whole lot of importers as library modules (e.g. atomizer-bookmarks, atomizer-vcard, etc.). Since all the importers are written as re-usable modules, we can later re-use them somewhere else. I expect to use some of these in a WASM runtime (see issue), or possibly in an atomic-server context to provide a drag-and-drop UI.

atomizer JS libraries, powered by @tomic/lib

Similar to the approach mentioned above, but written in JS (probably typescript): atomizer repo with a lot of libraries.

jonassmedegaard commented 2 years ago

Seems you stopped the above post in the middle of a senten

jonassmedegaard commented 2 years ago

I think this can be separated into two distinct tasks:

Definitions involve deeper knowledge about the data being modelled, often require knowledge about related fiends of knowledge for the created model to be most interoperable, and often involve programming e.g. to express the model as an ontology (ideally reusable also for RDF outside of Atomic Data) and for coding an extension library for atomizer.

Using involves knowledge on the concrete dataset, to sensibly classify especially vaguely hinted data (is "Elvis" a musician or a a text editor, or perhaps a personal teddy bear, in the context of the dataset currently being classified?).

jonassmedegaard commented 2 years ago

For the design of "atoms" (i.e. defining atomic data classes), I recommend to read GRDDL Primer for inspiration.

jonassmedegaard commented 2 years ago

For "splitting into atoms" (i.e. using atomic data classes), I very broadly expect these "modes" of use:

¹A "sidecar" is a hint file refining how to classify this kind of dataset. See darktable sidecars for inspiration of the concept of sidecars (but instead of tied only to one file, atomizer sidecars can be tied to the whole dataset or a subset (e.g. for a dataset consisting of a filesystem directory, a subset could be a subdir or a tarball, or for a dataset consisting of a website, a subset could be an upper path or a page). For the format of a sidecar, see RDF-EASE for a syntax tailored for web and XML content reusing CSS for semantic hinting. Possibly a more modern and popular approach for a sidecar format might involve SHACL (but that's just a stray idea, not thought through yet!)

jonassmedegaard commented 2 years ago

Ideas for types of atoms:

jonassmedegaard commented 2 years ago

For coding the atomizer engine, I recommend to do it as a library usable both as a standalone tool and directly integrated with Atomic Server, and abstract away all knowledge of specific "atoms" to topic-specific libraries, each also usable either standalone or integrated with atomizer.

By having the general library/tool load multiple *-atomizer helper libraries and extending recursion depth, relations across objects might be detected - e.g. image files with embedded EXIF hints and addresses (vCard data) with matching identifiers.

jonassmedegaard commented 2 years ago

The process of "scanning" for data can (commonly, maybe always?) be divided into several steps:

  1. select
    • for simple uses without recursion, selection equals input, but when recursing through multiple objects, selecting which to bother examine and which to completely skip needs to be decided rule-based - e.g. using include/exclude regexes or globs, or include/exclude MIME types, or conneg expressions
  2. decode
    • might be as simple as UTF-8 -> strings, but could be a legacy string encoding (maybe only for some objects - see recursion notes for select above), or could involve legacy double-encoded strings-in-strings (e.g. ISO 8859-1 data in otherwise UTF-8 data)
    • some data contains interleaved strings and binary parts - e.g. Postscript can contain comments as strings but also binary parts unparseable as UTF-8 (and either irrelevant e.g. when scanning rights information, fatal if scanning images, or important if scanning rights information in EXIF data embedded in images embedded in encapsulated Postscript) encoded as interleaved strings and binary chunks).
    • non-decodable character might be skipped, or replaced (e.g. transliterating when decoding as ASCII) or treated as failure to parse the whole object
  3. parse
    • traverse according to rules of data format, noting both vague and strong hints of semantic describable data
  4. qualify
    • assess collected hints and concluding what to "keep" based on format-specific qualifier rules
      • example: parsed emails may strongly qualify lots of addresses, but to exclude spam may qualify as contacts only senders from trusted domains and recipients for emails sent from trusted domain or by otherwise trusted sender
      • example: copyright holders with same name but different or missing email address may be treated as separate agents using a strict rule but as same agent using a loose rule - where a simple format-agnostic "threshold" level translates to applying either the loose or the strict rule
  5. structure
    • map onto output model
    • maybe normalize (e.g. specific order of semantically unordered items)
  6. serialize

NB! One object may contain other objects, not only by embedding (which can be handled with variable "recursion" level), but also through multiple interleaved encodings - i.e. each object potentially need to be decoded + parsed more than once. An SVG file may contain unstructured comments about rights, and structured RDF data about rights, and strings within rendered SVG data about rights. All three might resolve to same semantic information which should then (except at utmost strict threshold) be merged to one set of semantic data, but if ambiguous or conflicting then several sets of rules may offer varying conflict resolution.

AlexMikhalev commented 2 years ago

Notion API have a concept of block https://developers.notion.com/reference/block, common for both pages and databases. Notion .so also is trying to collect different types of data into their repository. I think model can be something similar to block with object type "object": "block" (or contact or bookmark) and hierarchy support "children":[{ of different types. I am not sure how to capture - source repository, i.e. twitter/google contacts, there are "collections" in notion API which may serve the same purpose.

jonassmedegaard commented 2 years ago

To me, the Notion list of objects looks useful as classes for detected semantic objects.

I don't see, however, how Notion is helpful in designing atomizer processes.

I think that as atomizer input a source repository is a piece of data that can potentially be decoded into multiple pieces of data. As atomizer output a source repository is multiple semantic objects, depending of what is being scanned for (either by choice or by limits of scanners implemented):

Similarly, a PDF file is as atomizer input one thing, but as atomizer output it is potentially multiple things:

In my work on streamlining Licensecheck I found the notion of "intermediaries" quite helpful. Problem in Licensecheck is that it scans human text where (at least with my capabilities as a programmer) there is no finite grammar, so the result of scanning is not certainly "copyright statements" and "license statements" but instead what I call "traits" - word compositions that factually exist in the text but only potentially hold the exact meaning that I am looking for.

A simpler example for same kind of dilemma is scanning for URIs in plaintext - if only humans would always include protocol at the beginning and wrap them all with <...> then it would be piece of cake to scan for them. Since humans are sloppy (leaving aside typos, even) some corner cases are ambiguous - e.g. if trailing punctuation belongs to the URI or to the surrounding text.

Essentially we want atomizer to resolve information contained in data. What I talk about with "traits" above seems to be the same as a proposed DAEK pyramid does differently than the more commonly used DIKW pyramid - make the reasoning step between information and knowledge explicit.

AlexMikhalev commented 2 years ago

Really interesting answer, let me think about it and come back to you after reading references. The immediate challenge I see in such behaviour for atomizer: how do you check if input changed or stays the same (eTag in HTTP)? Converting and extracting data will require making sure duplicates are not imported - bookmarks/contacts etc. I would approach atomizer differently: ontology - what types of objects atomizer supports, what is the taxonomy between objects types and then go into entity types. It seems to fit what you mean by "make the reasoning step between information and knowledge explicit", that step requires a knowledge graph, which is ontology/taxonomy/dictionary (thesaurus).

jonassmedegaard commented 2 years ago

By "a reasoning step" I don't mean formal semantics, and I think that is overkill here: What I mean is more casually some kind of internal qualifier rules (as described earlier).

I see now that my use of those pyramid models to describe how to get from atomizer input to atomizer output may be confusing: I take inspiration from formalized models of a larger world, but apply it to an internalized world of what happens inside atomizer - which I envision can be mostly internalized even further into specialized libraries for each output model. I think it is not necessary to design formal language for how those internal processes behave.

...but, as I introduced my previous post, I think it might make sense to define formal language for the output models.

jonassmedegaard commented 2 years ago

The reason I want atomizer modularized yet not formalized further internally is that I expect that to allow for relatively quickly cover a larger set of output models - and it allows for implementing competing libraries covering same output models - e.g. I could write a dirty crappy library to detect legal rights in source code, and you, @AlexMikhalev, could write a competing one applying formal reasoning and AI knowledge and whatever.

Or - as loosely discussed in a videomeeting recently - @joepio might write some competing libraries that was more secure (executing within a sandboxed environment) but more difficult to compile or execute (involving bleeding edge Rust code compiled into wasm binaries).

AlexMikhalev commented 2 years ago

I think one of the most important requirements for me is to prevent the double import of bookmarks and contacts. Also, I found https://github.com/vaimee/dasi-breaker - what is the difference and if we can learn from them?

jonassmedegaard commented 2 years ago

Atomizer targets Atomic Data which is more generic than Titanium JSON LD targeted with DASI Breaker - i.e. this project is far more generic, as it targets all atoms not only the relatively rare titanium. \

Thanks for sharing DASI Breaker - seems they are also in initial speculation phase with no concrete code yet. From a (far too) quick view it seems to be network-based - and I fear that it will be heavy (I have seen too many EU-sponsored project being too huge and inaccessible to my liking). I will certainly have a closer look, and if nothing else it might help clarify how this project is different.

jonassmedegaard commented 2 years ago

I think one of the most important requirements for me is to prevent the double import of bookmarks and contacts.

Great point. As an ideal I agree, but I don't expect it possible to ensure - or more accurately, if ensured we limit ourselves from processing certain types of sources.

I think that atomizer should not try detect duplicates across scans at all, because that would require access to existing knowledge: It would change from a simple "data in -> information out" with a write-only connection to an Atomic Server, to a "data + pre-existing knowledge in -> new knowledge out" with a read-write connection to an Atomic Server (apart from the complexity bloat within atomizer itself) would require a two-way dialogue with the Atom Server backend.

Instead, atomizer should try collect relevant data points allowing consumers of the atomizer output (notably Atomic Server) to recognize and evaluate how to deal with duplicates (e.g. through semantic reasoning).

Examples:

jonassmedegaard commented 2 years ago

Also, I found https://github.com/vaimee/dasi-breaker

I was mistaken that DASI Breaker lack code: They have done a minimally viable product available in git submodules.

I guessed correctly that DASI Breaker is implemented as a service, and a relatively heavy one at that. But regardless, it is quite an interesting and inspiring case!

what is the difference and if we can learn from them?

Difference is that DASI Breaker is multiple storage-backed and networked agents exchanging a multitude¹ of information, where atomizer is a single information exchange between one² constrained³ agent producing output consumable by an Atomic Server.

Interesting, I think, to try turn it around: What would it look like if the developers of DASI Breaker had prioritized resource constraints and built their framework around Atomic Server and Atomizer? How many of their intermediate agents might then be sensibly absorbed into either of those, and which would still remain?

I guess @joepio would excitedly want to implement SEPA into Atomic Server, and I would try convince that to be a separate (or at least separable) component - with the reasoning that such service might run under different agency (e.g. a public service, where Atomic Server might be behind a firewall) and also might not need same storage resources as a full-blown Atomic Server (where "full-blown" is noticable within our measure of constraints but dismissal for someone operating multiple dockerized Java virtual machines with PostGIS, Redis, and MySQL backends.

¹ Some DASI Breaker information exchanges use SPARQL, some SQL, and some NGSI-LD (which seems derived from but incompatible with JSON-LD - as a competing implemention describes it: "an extended subset of JSON-LD"). For comparison, Atomizer exchanges JSON-AD which is a subset of JSON-LD.

² I encourage organizing atomizer code as multiple semi-reusable libraries, but that is an implementation detail.

³ I want it possible to compile atomizer as a host-integrated binary executed as a unix-style command-line shell tool; @joepio wants it possible to build atomizer in a WASM binary executed in a JavaScript sandbox.

jonassmedegaard commented 2 years ago

Related project on curating interoperable shapes: https://shaperepo.com/

jonassmedegaard commented 2 years ago

Related collection of foo-to-RDF tools: https://www.w3.org/wiki/ConverterToRdf

joepio commented 1 year ago

We now have a functioning Importer class and a new JSON-AD publishing spec that definitely helps realizing the rest of the ambitions.