Closed GoogleCodeExporter closed 8 years ago
I've seen SDMX data that handles this by having a special code for "sum/total"
in the code list used for the
aggregated dimension. So if you have observations for each country, and then
want a total, you'd have a code
list with "UK", "IE", "FR", and "ALL" or "EU".
This feels a bit ugly because the meaning of this code is so context-dependent,
it's not really re-usable,
while the other country codes in general feel very re-usable. A SKOS
representation of the code list could
mitigate this a bit by making the code a parent code of the summed countries.
Also it's not easy to SPARQL against this, you need a *lot* of knowledge about
the dataset to isolate the total
(to find out that "EU" is a completely different thing from "UK"). But it seems
to be the proper SDMX way of
handling this?
I think the proposal of attaching summed observations to a group is quite
un-SDMX-like. Also, wether an
observation is derived should be indicated by an attribute rather than by
typing it as
sdmx:DerivedObservation.
Original comment by richard....@gmail.com
on 30 Mar 2010 at 12:19
To be clear, I'm not attaching observations to a group. The summation values
are in a
TimeSeries, just like normal. There is a relationship between that TimeSeries
and the
group of series of which it is the sum. Although, having just looked at my
code, I've
not yet materialized that relationship.
I accept that DerivedObservation isn't the way to go, and that current SDMX
attributes can be used to denote observations that aren't first-class observed
values.
Original comment by i.j.dick...@gmail.com
on 30 Mar 2010 at 12:34
> But it seems to be the proper SDMX way of handling this?
This seems to be another point where the goals of SDMX per se differ from the
goals
of published linked open statistics. I think we shouldn't feel overly
constrained by
faithfulness to SDMX if that in turn presents problems to LOD users.
Original comment by i.j.dick...@gmail.com
on 30 Mar 2010 at 12:44
Ian said: “The summation values are in a TimeSeries, just like normal. There
is a relationship between that
TimeSeries and the group of series of which it is the sum.”
My suggestion would be to have that relationship not between the TimeSeries,
but between the codes in the
code list (using skos:broader etc).
So, if I have <x> and want to see wether that's an aggregate, I could do:
SELECT ?dimension ?part WHERE {
<x> ?dimension ?code .
?part ?dimension ?subcode .
?subcode skos:broader ?code .
}
This is more verbose, but also tells you the dimension along which the
aggregation was performed.
I created Issue 23 to track discussion on the question of faithfulness to SDMX
vs. going with simpler/more
obvious solutions.
Original comment by richard....@gmail.com
on 1 Apr 2010 at 9:28
I think I've now got to the bottom of the aggregation support in SDMX, and have
a
proposal for how we should proceed. As references, I'm using section 6 (p. 56)
and
section 8 (p. 64) of the Information Model, and section 6.3.8 (p. 92) of the
Implementors Guide.
So as far as I can see, everyone** who has suggested that aggregation and other
roll-
up structure in data like the table in PESA I've been working with is exactly
right.
Relations between the codes express the relationship between a collection of
data and
its summation, average, etc. However, there's a design choice in SDMX that the
hierarchy is represented separately from the concepts in the code list, so in
principle the same codes could appear in different roles in different
hierarchies.
The codes in a code list are a flat structure; a separate Hierarchy object
collects
together the CodeAssociations between parent and child codes in a hierarchy.
SDMX defines two ways of defining hierarchies: value based, and level based.
Value-
based is a typical graph structure: each association node connects one parent
to one
child, with a named association type. It's a reified triple, iow. Level based
declares certain groups of codes to be level1, level2, etc.
I don't have any particular data to base this on, but my gut feeling is that
the
times that one code can appear in different roles in different hierarchies is
relatively rare, especially in domain-specific code-lists such as the PESA ones
I've
been looking at.
Proposal
--------
I suggest we support two patterns for encoding hierarchical code lists in
SDMX-RDF:
simple and (more) complete.
In the simple case, we forego the option to have code list concepts play
different
named roles. In this case, we can simply add skos:broader / skos:narrower
relations
to the concept definitions themselves.
In the more complete case, we keep a flat list of concepts as currently, but
add an
additional sdmx:Hierarchy definition. This would require some extra vocabulary
in
sdmx.ttl:
sdmx:Hierarchy
sdmx:assocation
sdmx:CodeAssocation
sdmx:source
sdmx:target
sdmx:associationType
sdmx:AssociationType
This is only partial support for SDMX, since it focusses on the value-based
hierarchy, and ignores the level-based hierarchy. I feel the value based
hierarchy is
a better fit for RDF's graph structure, and I believe that a level-based
hierarchy in
SDMX-ML could be transformed to a value-based hierarchy in SDMX-RDF (proof left
as
easy exercise for reader :). I've also left out a couple of nodes from the UML
structure in the Information Model (e.g. CodeComposition), but I don't think
they add
any representational power.
By default, we would advocate the use of the simple scheme.
** That would be Jeni, Dave and Richard!
Original comment by i.j.dick...@gmail.com
on 13 Apr 2010 at 11:06
Sounds like exactly the right approach to me.
Original comment by jeni.ten...@gmail.com
on 13 Apr 2010 at 11:50
Thanks for looking into this Ian, this clarifies everything a whole lot.
I'd advocate focusing on the simple hierarchy scheme and leaving the complex
one for later, because it
doesn't seem to be essential to what we are doing right now, and we'd really
need to look at some good
examples of existing SDMX usage.
I have one remaining doubt: If you have a SKOS hierarchy, how do you know the
kind of aggregation that is
performed? For example, if A has three sub-codes B, C and D, how do I know
wether obsValues for A are the
sum or average of B, C and D? Is this information encoded somewhere in a
complete SDMX representation?
Original comment by richard....@gmail.com
on 13 Apr 2010 at 12:22
My reading of the Information Model (section 6, p.79 not 56) is that the term
"aggregation" is being used purely to denote summation. Though that section is
not
completely clear.
In any case I think a generalized notion aggregation is out of scope for now.
Dave
Original comment by Dave.e.R...@gmail.com
on 13 Apr 2010 at 12:38
Ok, works for me. Thanks Dave.
Original comment by richard....@gmail.com
on 13 Apr 2010 at 1:03
On the call, Ian pointed out that it's usually the aggregation concepts that
differ from dataset to dataset, and
that re-use of the base-level concepts becomes easier if only skos:narrower
relationships are asserted (not
skos:broader)
Original comment by richard....@gmail.com
on 15 Apr 2010 at 9:33
All clear now.
Original comment by richard....@gmail.com
on 22 Apr 2010 at 9:52
Original issue reported on code.google.com by
i.j.dick...@gmail.com
on 24 Mar 2010 at 6:46