Create a sample of user assertions about content.

monkeypants commented 9 years ago

There are two kinds of assertions:

about the page (PageAssertion)
about chunks of information in the page (ElementAssertion)

I think PageAssertions would be sufficient for the first iteration, although ElementAssertions are potentially really interesting from a information extraction perspective.

In the RDF convention for PageAssertions, the pattern would be triples of {subject, predicate, object}. In OrientDB, would we have Subject and Objects as subclasses of Vertex, and Predicates as subclasses of Edge?

If these are the Subjects and Objects:

Page
Service
Department
Domain

These predicates could be asserted automatically:

(Page) has_menu_link_to (Page)
(Page) is_published_at (Domain)

These predicates could be asserted by people:

(Page) is_entry_point_page_for (Service)
(Page) pertains_to_the_topic_of (Service)
(Domain) is_managed_by (Department)

Then we could establish (by inference):

(Page) is_published_by (Department)
(Department) publishes_entry_point_page_for (Service)
(Department) publishes_content_pertinent_to (Service)

What else would be useful for the first iteration?

This comment also relates to #10 and #12. Do we need another interface for managing the Department/domain ownership problem? Can we stuff a dataset into data.gov.au and pretend there's a loose-coupled process somewhere keeping it accurate?

Fancy information extraction stuff might make educated guesses about "pertains_to_topic" predicate (for example), so the human-in-the-loop process might evolve into endorsing/refuting automatically guesses. But that's obviously for later, along with making assertions about PageElements.

markmuir87 commented 9 years ago

Agreed on starting with assertions about pages for the first pass and working our way down to discrete 'content chunks' on subsequent iterations. As discussed, has the benefit of no wasted work, as what's true for the page should be true for the chunks it contains.

Yeah that's how I envisioned some of the semi-automated assertions working (i.e. assisting, rather than replacing, the human tagger). It also provides a natural opportunity for implicit supervision/teaching of a ML algo (which is obviously really 'far into the future' stuff), meaning the 'assistance' will become more accurate over time.

I'm going to have to read up on RDF in more detail. Reading the wikipedia entry has me a bit confused. When they say 'object', they kind of mean 'property' right? And 'subject' is 'the thing being described'? So '[The sky] [is of the colour] [blue]' is an instance of a subject-predicate-object triple instance in RDF?

If so, that mapping seems to make sense, though is there any need to have instances of 'things' that are both subjects and objects (i.e. is there a need to describe any 'subjects' that are not instances of 'content', like page, content chunk etc.)?

Would it be accurate to say we'd be creating classical inheritance hierarchies like:

Vertex -> Subject -> Page
Vertex -> Subject -> Element (or would this be a subclass of 'Page'?)
Vertex -> Object -> Domain
Vertex -> Object -> Portfolio
Vertex -> Object -> Service
Edge -> Predicate

Having 'Predicate' as a subclass of 'Edge' might be redundant. If I'm interpreting you correctly, all edges will implicitly and always be predicates?

And so 'page' and 'content chunk' are 'subjects'? Domain, Department, Service are all objects? For example:

pageInstance1 - is_hosted_on - domainInstance2
pageInstance1 - is_published_by - departmentInstance4
etc.

Depending on the domain, we can also make auto-assertions based on url paths. For example, an instance of the classes [Page] - [Predicate] - [Topic] for http://www.business.gov.au/registration-and-licences/Pages/register-for-goods-and-services-tax-(GST).aspx would be:

[register-for-goods-and-services-tax-(GST).aspx] - [pertains_to_the_topic_of] - [registration-and-licences]

which is an instance of the following subclasses triple:

[Vertex -> Subject -> Page] - [Edge -> Predicate] - [Vertex -> Object -> Service]

Though, to be honest, I've never understood the advantages of classical inheritance. It forces you on to a continuum where the extremes represent two bad alternatives: rigid (and opinionated) taxonomies, or violation of the DRY principle.

See: https://medium.com/javascript-scene/the-two-pillars-of-javascript-ee6f3281e7f3
“The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.” ~ Joe Armstrong
Also of interest: http://www.well.com/~doctorow/metacrap.htm#2.5 (although some of the problems outlined don't really apply in our case; it's written with a different context in mind)

My preference would be to keep things as flat as possible, though I don't feel qualified enough to hold a strong opinion. What are your views on the relative merits?

monkeypants commented 9 years ago

Thinking about Things as concrete instances of anything is potentially confusing. RDF (and metadata in general) is information about the instance, not the instance itself. So the Subject and Object are references to (web) resources (URIs), that's the R in RDF.

If we want to describe a relationship (predicate) between a Department and Service (for example), then the RDF way would be to reference a page about each. We don't need that (pages), we can have nodes for each in our Graph DB, so the predicate can be a an edge, if...

...assuming we don't want to make assertions about individual relationships. For example, this would be OK with the predicate as a type of orientDB edge:

PageX pertains_to_topic_of ServiceY

However if we wanted to describe the relationship itself (predicate), the predicate would need to be a node in it's own right. For example, say we wanted a pertinence quantifier:

PageX has TopicPertinanceZ
TopicPertinenceZ is_about ServiceY
TopicPertinanceZ has_strength 0.42

An RDF triple is not a graph, it's a tripplestore. We don't plan to use RDF tools, but could create an RDF interface to our graph later if we needed one (RDF. Is only a text-based rendering of metadata content). For example, if we had only two subclasses of Edge, subject_predicate and predicate_object, then every node could be a subject, object or predicate. We could make any "triple" (metadata association) with a pair of those edges.

We have a few options, it's probably worth a whiteboard session to figure out what's simpler/easier.

Describing relationships between subject and object nodes with a pair of edges and a predicate node
- having a whole bunch of special predicate Edge types, and expanding them out to predicate nodes etc if/when we need to.
just reaching for a tripplestore if/when we think we need one.

markmuir87 commented 9 years ago

Yeah white boarding sounds like a good plan. Heck, we could huddle around a computer and hammer out a few json schemas maybe.

I'll have to do a bit more research into OrientDB (I've been saying this a lot recently...), but I thought I read somewhere that ODB allows edges can to have properties associated with them (which would be extremely handy) as they're treated as 'first class objects', whatever the heck that means.

And I'm not just talking about direction of relationship, I mean arbitrary properties. I'm totally with you on the 'predicates as edges' thing, it's a very nice fit, and if we can assign arbitrary properties to those edges it would be icing on the cake.

My long post just boiled down to: I vote we steer clear of creating ultra-hierarchical taxonomies by using classical object inheritance. The more I think about it, the more convinced I become that it would be a very bad move...

monkeypants commented 9 years ago

When we bang it out, it should be as an OrientDB graph :)

AusDTO / disco_layer

Create a sample of user assertions about content. #15