SemanticMediaWiki / SemanticMediaWiki

🔗 Semantic MediaWiki turns MediaWiki into a knowledge management platform with query and export capabilities
https://www.semantic-mediawiki.org
Other
508 stars 227 forks source link

Embed structured data using the `smw/data` namespace + `#data` parser function #4044

Open mwjames opened 5 years ago

mwjames commented 5 years ago

The idea is to create a new smw/data namespace (owned by SMW) which can host different structured formats such as JSON, XML, or CSV with data being stored in its native format and can be embedded using a newly introduced #data parser function.

The parser function makes those data (all or only selected fields) available in pages that embed the #data and at the same provides an annotation mapping to those referenced data fields.

A data page itself will not hold any active annotation reference, only the embedded content will create annotations hereby make them available to the storage and search backend.

Objective

Example

smw/data:Berlin.json

{
    "population": "3520061",
    "area": "891.85 km²",
    "coordinates": "52° 31' 0.00\" N, 13° 24 ' 0.00\" E",
    "viaf": 154702072
}

Berlin

{{#data: Berlin.json|?population}}
jaideraf commented 5 years ago

I like the idea. Just to understand, would {{#data: Berlin.json|?population}} be equivalent to {{#set:Population=3520061}} on the Berlin page? Or equivalent to [[Population::3520061]]? Would this feature be similar to External Data extension?

kghbln commented 5 years ago

The Rule namespace could be reused for this.

mwjames commented 5 years ago

Just to understand, would {{#data: Berlin.json|?population}} be equivalent to {{#set:Population=3520061}} on the Berlin page?

The specifics are up for discussion.

Data](https://www.mediawiki.org/wiki/Extension:External_Data) extension?

The problem with that extension (as far as I recall) is that it retrieves data from an external service and once the services goes away the data is unavailable. Also, you cannot track or diff any changes of the imported data.

The objective is to own the data and not just retrieve them, it is something like a datasheet that contains the raw data where #data is responsible to disseminate the data to content that requires it.

Examples

[0] contains some data sets we would handle via the smw/data NS so that instead of wiki-text containing large amount of #set annotations, the data [1] is going to be embedded (aka locally imported, disseminated).

[0] https://sandbox.semantic-mediawiki.org/wiki/Cat%C3%A9gorie:NFL:Data [1] https://sandbox.semantic-mediawiki.org/wiki/MediaWiki:Issue4044/Data1.json

On 5/28/19, Jaider Andrade Ferreira notifications@github.com wrote:

I like the idea. Just to understand, would {{#data: Berlin.json|?population}} be equivalent to {{#set:Population=3520061}} on the Berlin page? Or equivalent to [[Population::3520061]]? Would this feature be similar to External Data extension?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/4044#issuecomment-496485296

mwjames commented 5 years ago

Here is another example of "owning" the data, for example importing city information such as [0] should be strait forward and independent of the wikitext syntax.

Now, if you create a page called smw/data:geographer.data.cities.ae.json and import [0] then you could use the #data parser function on a page like Dubai to declare and import data using the JSON path expression [1]:

[[Has population::{{#data: geographer.data.cities.ae.json |field= 292223.population }}]]
[[Has latitude ::{{#data: geographer.data.cities.ae.json |field= 292223.lat }}]]
[[Has longitude::{{#data: geographer.data.cities.ae.json |field= 292223.lng }}]]
{
    "291074": {
        "ids": {
            "geonames": 291074
        },
        "long": {
            "default": "Ras al-Khaimah"
        },
        "parent": 291075,
        "lat": 25.78953,
        "lng": 55.9432,
        "country": "AE",
        "population": 115949
    },
    "292223": {
        "ids": {
            "geonames": 292223
        },
        "long": {
            "default": "Dubai"
        },
        "parent": 292224,
        "lat": 25.0657,
        "lng": 55.17128,
        "country": "AE",
        "population": 1137347
},

[0] https://github.com/MenaraSolutions/geographer-data/blob/master/resources/cities/AE.json [1] https://github.com/json-path/JsonPath

mwjames commented 5 years ago

Just to understand, would {{#data: Berlin.json|?population}} be equivalent to {{#set:Population=3520061}} on the Berlin page? The specifics are up for discussion.

Simplifying the importing statement could also be made available by:

Finding, mapping, and setting annotations related to a subject that embeds the #data

{{#data: geographer.data.cities.ae.json
 |Has population=292223.population
 |Has latitude=292223.lat
 |Has longitude=292223.lng
}}

Query a data field (using ? as indicator)

[[Has population::{{#data: geographer.data.cities.ae.json |?292223.population }}]]
jaideraf commented 5 years ago

The Rule namespace could be reused for this.

I also have a concern about namespaces creation. In my experience with MW, creating several namespaces is like a "Pandora's box". I would recomend reuse a SMW namespace, if possible. "smw/schema" is already odd in terms of naming convention (inicial lowercase, use of "/" in the name). An additional "smw/data" namespace rises an alert in my head... (sorry about not being totaly rational here).

But I love the whole ideia. :+1:

EDIT: It would be the third namespace creation in few releases.

mwjames commented 5 years ago

I would recomend reuse a SMW namespace, if possible. "smw/schema" is already odd in terms of naming convention (inicial lowercase, use of "/" in the name). An additional "smw/data"

Which one? There is no SMW namespace, smw/schema as NS is assigned to a specific content type that comes with its own checks and validations and should not be "misused" for transactional data.

An additional "smw/data" namespace rises an alert in my head... (sorry about not being tottaly rational here).

smw/data is about transactional data.

if possible. "smw/schema" is already odd in terms of naming convention

Well, the NS name "Schema" was already occupied by some other extension and doing SMW/schema (and variations) was ascetically a no go zone hence it became smw/schema and I would keep that convention for smw/data as well. Adding smw/... in front ensures we are not bumping into any NS conflicts with other extensions that may or may not be deployed by a user.

mwjames commented 5 years ago

Adding smw/... in front ensures we are not bumping into any NS conflicts with other extensions that may or may not be deployed by a user.

While experimenting with NS names I tried smw:schema as well but that would not work with MediaWiki so I settled for smw/....

mwjames commented 5 years ago

Furthermore, in order to create a content type that controls the type of data you expect you need a separate NS otherwise creating something like Foo.json in the NS_MAIN would select the standard JSON content type/handler which is not suitable for what this issue is trying to address. The new NS ensures that smw/data:Foo.json uses the SMW specific content type/handler and is not "owned" by MediaWiki.

jaideraf commented 5 years ago

Furthermore, in order to create a content type that controls the type of data you expect you need a separate NS otherwise creating something like Foo.json in the NS_MAIN would select the standard JSON content type/handler which is not suitable for what this issue is trying to address.

Yes. But the content type can also be attributed manually, via Special:ChangeContentModel, right? We do not need to reserve an entirely namespace to hold just a few pages (I presume). A lot of SMW users will probably not use all these wonderful features but will have to deal with three or more empty namespaces (Rule, smw/schema, smw/data).

Wishful thinking: One SMW namespace to hold all these new stuff (rules, schemata, data...)

The organizing name conventions would be by prefixes and subpages. For example:

I18n and L10n of prefixes could be a problem, I know.

Thinking about that, Special:SemanticMediaWiki could be transformed into a dashboard sumarizing all SMW related configuration and information about the system. From there, we could get data about Properties (total, most used, issues, etc.), Conceps (total, most used, issues, etc.), Queries (total, issues, etc.), Datatypes (most used, etc.), Imported vocabularies, Rules, Schemata, Data pages (from this issue), etc. and the maintenance tasks. I know it is not the right spot to talk about Special:SemanticMediaWiki, but I don't want to forget the idea).

jaideraf commented 5 years ago

We would need to set $wgNamespacesWithSubpages[NS_SMW] = true; in order to manage subpages in SMW namespace.

D-Groenewegen commented 5 years ago

This ia a fantastic and promising proposal and I feel sorry for not having responded with appropriate enthusiasm earlier. There are many different situations in which you might prefer standard formats like JSON/XML/CSV over wikitext, and I've certainly met some of them.

About the smw/data namespace. I'm fine with the suggested naming and appreciate that 'owning' a namespace should help reduce reliance on MediaWiki and whatever its developers do, or will not do, in the future.

What about JSON/XML/CSV pages that serve other purposes in addition to having their data extracted by SMW? What if we are already using JSON and XML pages in a wiki (as I am) and we want to apply Semantic MediaWiki's #data parser? First we'd need to copy them over to the SMW namespace, okay, but would we not risk losing functionality? If we have custom ways of representing, say, an XML page - for instance, I've started to use CETEIcean (https://github.com/TEIC/CETEIcean) for doing stuff with documents encoded in TEI XML - would we still be able to call pages from the smw/data namespace? Or do we need to keep duplicate pages in different namespaces? Maybe that's just the way it is if that helps keeping the proposed scope in check but I'm curious if anything can be said at this stage. (I realise that more examples might help.)

More to the point. What syntax and kinds of expressions should the data parser support? JSONPath for JSON and XPath or even XQuery for XML? Or do subsets first?

JeroenDeDauw commented 5 years ago

This is related to the idea of having dedicated pages for storing data using a structured format with a visual editing interface on top. We talked about this some years ago but I cannot find any of that.

Description

More concretely, the idea was to have a new namespace added by SMW. On pages in this NS you'd be able to set property-value pairs, subobjects, etc via visual interface. This would be stored as JSON in the page and into SMW itself just like it would have been if defined via wikitext. Overall a similar setup as Wikibase items.

The same could be applied to existing SMW namespaces such as Property. Wikitext would remain supported, though probably could could not mix wikitext and JSON. (Both on one page could work though putting wikitext in the JSON probably not.)

Relation to current proposal

While the goal of the old proposal was mainly UX and a foundation for new capabilities and the new proposal seems to be about importing data and some UX component, their approaches are similar. Both introduce a new namespace to hold data in a standard structured format.

Worthily to try achieve all these goals with one approach, since no one wants two slightly different data namespaces added by SMW. I'm all for incremental development and delivery, so am not saying everything should be done at once. I am suggesting we do things in such a way that all the goals can be met later.

See also

The recent-ish-ly added GeoJson namespace in Maps was created for similar reasons. It allows easily importing data in the standard format, it separates the data from the display and makes it possible to create a fully visual editing experience for the data. At the moment there is no such visual editor yet, though it is easy to see how that would work.

https://www.semantic-mediawiki.org/wiki/Extension:Maps/GeoJSON#GeoJson_pages https://github.com/JeroenDeDauw/Maps/issues/447

alex-mashin commented 3 years ago

This can already be done by storing XML/JSON, wrapped in a template or Lua invocation that parses data into subobjects, in some non-main namespace, like Project: and the querying subobjects from main namespace.

krabina commented 1 year ago

With the availability of slots this should be rethought. However, the general idea is still great, to store structured data in the wiki, being able to access it with a parser function.