Letractively / publishing-statistical-data

Automatically exported from code.google.com/p/publishing-statistical-data
0 stars 0 forks source link

Denotation for derived statistics #9

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
In the PESA data, many of the time series represent sub-totals or other
summations, rather than original observations. Discussion on the list
suggests that it would be good to include these sub-totals in the linked
data presentation, since they are part of the official .xls presentation
from the treasury. 

However, to represent the three levels of sub-totalling in the dataset as
normal observations would require three extra dimension properties, to
identify the totals and distinguish them from the base-level (non-summed
numbers). 

An alternative is to just connect the time series for the summed values to
a group key denoting the summed-over values (i.e. to not give dimensions
for the sum time series). I think this would be a better design, but I
would suggest that the sdmx:Observations in this case should be
distinguished from the normal case of observations. This could be achieved
by decorating the Observation resource with a suitable SDMX attribute, but
I couldn't find one that matches the use case. Alternatively, we could
introduce a new sub-class of Observation (e.g. sdmx:DervivedObservation).

Original issue reported on code.google.com by i.j.dick...@gmail.com on 24 Mar 2010 at 6:46

GoogleCodeExporter commented 8 years ago
I've seen SDMX data that handles this by having a special code for "sum/total" 
in the code list used for the 
aggregated dimension. So if you have observations for each country, and then 
want a total, you'd have a code 
list with "UK", "IE", "FR", and "ALL" or "EU".

This feels a bit ugly because the meaning of this code is so context-dependent, 
it's not really re-usable, 
while the other country codes in general feel very re-usable. A SKOS 
representation of the code list could 
mitigate this a bit by making the code a parent code of the summed countries.

Also it's not easy to SPARQL against this, you need a *lot* of knowledge about 
the dataset to isolate the total 
(to find out that "EU" is a completely different thing from "UK"). But it seems 
to be the proper SDMX way of 
handling this?

I think the proposal of attaching summed observations to a group is quite 
un-SDMX-like. Also, wether an 
observation is derived should be indicated by an attribute rather than by 
typing it as 
sdmx:DerivedObservation.

Original comment by richard....@gmail.com on 30 Mar 2010 at 12:19

GoogleCodeExporter commented 8 years ago
To be clear, I'm not attaching observations to a group. The summation values 
are in a
TimeSeries, just like normal. There is a relationship between that TimeSeries 
and the
group of series of which it is the sum. Although, having just looked at my 
code, I've
not yet materialized that relationship.

I accept that DerivedObservation isn't the way to go, and that current SDMX
attributes can be used to denote observations that aren't first-class observed 
values.

Original comment by i.j.dick...@gmail.com on 30 Mar 2010 at 12:34

GoogleCodeExporter commented 8 years ago
> But it seems to be the proper SDMX way of handling this?
This seems to be another point where the goals of SDMX per se differ from the 
goals
of published linked open statistics. I think we shouldn't feel overly 
constrained by
faithfulness to SDMX if that in turn presents problems to LOD users.

Original comment by i.j.dick...@gmail.com on 30 Mar 2010 at 12:44

GoogleCodeExporter commented 8 years ago
Ian said: “The summation values are in a TimeSeries, just like normal. There 
is a relationship between that 
TimeSeries and the group of series of which it is the sum.”

My suggestion would be to have that relationship not between the TimeSeries, 
but between the codes in the 
code list (using skos:broader etc).

So, if I have <x> and want to see wether that's an aggregate, I could do:

SELECT ?dimension ?part WHERE {
  <x> ?dimension ?code .
  ?part ?dimension ?subcode .
  ?subcode skos:broader ?code .
}

This is more verbose, but also tells you the dimension along which the 
aggregation was performed.

I created Issue 23 to track discussion on the question of faithfulness to SDMX 
vs. going with simpler/more 
obvious solutions.

Original comment by richard....@gmail.com on 1 Apr 2010 at 9:28

GoogleCodeExporter commented 8 years ago
I think I've now got to the bottom of the aggregation support in SDMX, and have 
a 
proposal for how we should proceed. As references, I'm using section 6 (p. 56) 
and 
section 8 (p. 64) of the Information Model, and section 6.3.8 (p. 92) of the 
Implementors Guide.

So as far as I can see, everyone** who has suggested that aggregation and other 
roll-
up structure in data like the table in PESA I've been working with is exactly 
right. 
Relations between the codes express the relationship between a collection of 
data and 
its summation, average, etc. However, there's a design choice in SDMX that the 
hierarchy is represented separately from the concepts in the code list, so in 
principle the same codes could appear in different roles in different 
hierarchies. 
The codes in a code list are a flat structure; a separate Hierarchy object 
collects 
together the CodeAssociations between parent and child codes in a hierarchy.

SDMX defines two ways of defining hierarchies: value based, and level based. 
Value-
based is a typical graph structure: each association node connects one parent 
to one 
child, with a named association type. It's a reified triple, iow. Level based 
declares certain groups of codes to be level1, level2, etc.

I don't have any particular data to base this on, but my gut feeling is that 
the 
times that one code can appear in different roles in different hierarchies is 
relatively rare, especially in domain-specific code-lists such as the PESA ones 
I've 
been looking at.

Proposal
--------

I suggest we support two patterns for encoding hierarchical code lists in 
SDMX-RDF: 
simple and (more) complete.

In the simple case, we forego the option to have code list concepts play 
different 
named roles. In this case, we can simply add skos:broader / skos:narrower 
relations 
to the concept definitions themselves.

In the more complete case, we keep a flat list of concepts as currently, but 
add an 
additional sdmx:Hierarchy definition. This would require some extra vocabulary 
in 
sdmx.ttl:

sdmx:Hierarchy
sdmx:assocation 
sdmx:CodeAssocation
sdmx:source
sdmx:target
sdmx:associationType
sdmx:AssociationType

This is only partial support for SDMX, since it focusses on the value-based 
hierarchy, and ignores the level-based hierarchy. I feel the value based 
hierarchy is 
a better fit for RDF's graph structure, and I believe that a level-based 
hierarchy in  
SDMX-ML could be transformed to a value-based hierarchy in SDMX-RDF (proof left 
as 
easy exercise for reader :).  I've also left out a couple of nodes from the UML 
structure in the Information Model (e.g. CodeComposition), but I don't think 
they add 
any representational power.

By default, we would advocate the use of the simple scheme.

** That would be Jeni, Dave and Richard!

Original comment by i.j.dick...@gmail.com on 13 Apr 2010 at 11:06

GoogleCodeExporter commented 8 years ago
Sounds like exactly the right approach to me.

Original comment by jeni.ten...@gmail.com on 13 Apr 2010 at 11:50

GoogleCodeExporter commented 8 years ago
Thanks for looking into this Ian, this clarifies everything a whole lot.

I'd advocate focusing on the simple hierarchy scheme and leaving the complex 
one for later, because it 
doesn't seem to be essential to what we are doing right now, and we'd really 
need to look at some good 
examples of existing SDMX usage.

I have one remaining doubt: If you have a SKOS hierarchy, how do you know the 
kind of aggregation that is 
performed? For example, if A has three sub-codes B, C and D, how do I know 
wether obsValues for A are the 
sum or average of B, C and D? Is this information encoded somewhere in a 
complete SDMX representation?

Original comment by richard....@gmail.com on 13 Apr 2010 at 12:22

GoogleCodeExporter commented 8 years ago
My reading of the Information Model (section 6, p.79 not 56) is that the term
"aggregation" is being used purely to denote summation. Though that section is 
not
completely clear. 

In any case I think a generalized notion aggregation is out of scope for now.

Dave

Original comment by Dave.e.R...@gmail.com on 13 Apr 2010 at 12:38

GoogleCodeExporter commented 8 years ago
Ok, works for me. Thanks Dave.

Original comment by richard....@gmail.com on 13 Apr 2010 at 1:03

GoogleCodeExporter commented 8 years ago
On the call, Ian pointed out that it's usually the aggregation concepts that 
differ from dataset to dataset, and 
that re-use of the base-level concepts becomes easier if only skos:narrower 
relationships are asserted (not 
skos:broader)

Original comment by richard....@gmail.com on 15 Apr 2010 at 9:33

GoogleCodeExporter commented 8 years ago
All clear now.

Original comment by richard....@gmail.com on 22 Apr 2010 at 9:52