Open nokout opened 9 years ago
Agreed on starting with assertions about pages for the first pass and working our way down to discrete 'content chunks' on subsequent iterations. As discussed, has the benefit of no wasted work, as what's true for the page should be true for the chunks it contains.
Yeah that's how I envisioned some of the semi-automated assertions working (i.e. assisting, rather than replacing, the human tagger). It also provides a natural opportunity for implicit supervision/teaching of a ML algo (which is obviously really 'far into the future' stuff), meaning the 'assistance' will become more accurate over time.
I'm going to have to read up on RDF in more detail. Reading the wikipedia entry has me a bit confused. When they say 'object', they kind of mean 'property' right? And 'subject' is 'the thing being described'? So '[The sky] [is of the colour] [blue]' is an instance of a subject-predicate-object triple instance in RDF?
If so, that mapping seems to make sense, though is there any need to have instances of 'things' that are both subjects and objects (i.e. is there a need to describe any 'subjects' that are not instances of 'content', like page, content chunk etc.)?
Would it be accurate to say we'd be creating classical inheritance hierarchies like:
Having 'Predicate' as a subclass of 'Edge' might be redundant. If I'm interpreting you correctly, all edges will implicitly and always be predicates?
And so 'page' and 'content chunk' are 'subjects'? Domain, Department, Service are all objects? For example:
Depending on the domain, we can also make auto-assertions based on url paths. For example, an instance of the classes [Page] - [Predicate] - [Topic] for http://www.business.gov.au/registration-and-licences/Pages/register-for-goods-and-services-tax-(GST).aspx would be:
[register-for-goods-and-services-tax-(GST).aspx] - [pertains_to_the_topic_of] - [registration-and-licences]
which is an instance of the following subclasses triple:
[Vertex -> Subject -> Page] - [Edge -> Predicate] - [Vertex -> Object -> Service]
Though, to be honest, I've never understood the advantages of classical inheritance. It forces you on to a continuum where the extremes represent two bad alternatives: rigid (and opinionated) taxonomies, or violation of the DRY principle.
My preference would be to keep things as flat as possible, though I don't feel qualified enough to hold a strong opinion. What are your views on the relative merits?
Thinking about Things as concrete instances of anything is potentially confusing. RDF (and metadata in general) is information about the instance, not the instance itself. So the Subject and Object are references to (web) resources (URIs), that's the R in RDF.
If we want to describe a relationship (predicate) between a Department and Service (for example), then the RDF way would be to reference a page about each. We don't need that (pages), we can have nodes for each in our Graph DB, so the predicate can be a an edge, if...
...assuming we don't want to make assertions about individual relationships. For example, this would be OK with the predicate as a type of orientDB edge:
However if we wanted to describe the relationship itself (predicate), the predicate would need to be a node in it's own right. For example, say we wanted a pertinence quantifier:
An RDF triple is not a graph, it's a tripplestore. We don't plan to use RDF tools, but could create an RDF interface to our graph later if we needed one (RDF. Is only a text-based rendering of metadata content). For example, if we had only two subclasses of Edge, subject_predicate and predicate_object, then every node could be a subject, object or predicate. We could make any "triple" (metadata association) with a pair of those edges.
We have a few options, it's probably worth a whiteboard session to figure out what's simpler/easier.
Yeah white boarding sounds like a good plan. Heck, we could huddle around a computer and hammer out a few json schemas maybe.
I'll have to do a bit more research into OrientDB (I've been saying this a lot recently...), but I thought I read somewhere that ODB allows edges can to have properties associated with them (which would be extremely handy) as they're treated as 'first class objects', whatever the heck that means.
And I'm not just talking about direction of relationship, I mean arbitrary properties. I'm totally with you on the 'predicates as edges' thing, it's a very nice fit, and if we can assign arbitrary properties to those edges it would be icing on the cake.
My long post just boiled down to: I vote we steer clear of creating ultra-hierarchical taxonomies by using classical object inheritance. The more I think about it, the more convinced I become that it would be a very bad move...
When we bang it out, it should be as an OrientDB graph :)
There are two kinds of assertions:
I think PageAssertions would be sufficient for the first iteration, although ElementAssertions are potentially really interesting from a information extraction perspective.
In the RDF convention for PageAssertions, the pattern would be triples of {subject, predicate, object}. In OrientDB, would we have Subject and Objects as subclasses of Vertex, and Predicates as subclasses of Edge?
If these are the Subjects and Objects:
These predicates could be asserted automatically:
These predicates could be asserted by people:
Then we could establish (by inference):
What else would be useful for the first iteration?
This comment also relates to #10 and #12. Do we need another interface for managing the Department/domain ownership problem? Can we stuff a dataset into data.gov.au and pretend there's a loose-coupled process somewhere keeping it accurate?
Fancy information extraction stuff might make educated guesses about "pertains_to_topic" predicate (for example), so the human-in-the-loop process might evolve into endorsing/refuting automatically guesses. But that's obviously for later, along with making assertions about PageElements.