Letractively / publishing-statistical-data

Automatically exported from code.google.com/p/publishing-statistical-data
0 stars 0 forks source link

Attachment level #12

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
SDMX allows ComponentProperties to be attached at the TimeSeries or Groups
level. 

The current SDMX-RDF assumption is that all such values attach to each
observation for ease of query.

Should be allow at least some values (e.g. Measures) to be attached at
higher levels?

Original issue reported on code.google.com by Dave.e.R...@gmail.com on 25 Mar 2010 at 10:07

GoogleCodeExporter commented 9 years ago
I think especially measures should always be at the observation level! Isn't 
that what defines an observation?

I think for dimensions it's extremely desirable to have as few alternative 
attachment levels as possible, 
preferably only one. Writing SPARQL queries for retrieving a specific 
observation is horrible if each dimension 
value could be found on any level.

For attributes a bit more flexibility may not hurt, because here typical 
queries are more like “all attributes on 
time series level in cube region xyz”, and the cube region is defined by 
dimensions.

If I'm not mistaken, attachment levels in SDMX are as follows:

For time series data sets:
- measure is always attached on the observation level
- the time pseudo-dimension is always on the observation level (it's not a 
declared dimension)
- all other dimensions are on the time series level
- attributes can be attached on observation, time series, group, or dataset 
level
- dimensions that are used in group keys are redundant; the values of those 
dimensions are found on the 
group and on the time series level (in the information model; in concrete 
syntaxes they might be implied by 
nesting on the time series level)

For cross-sectional data sets (not sure if I really understand everything):
- there is no measure but there are usually several XSMeasures, which are 
referenced on the observation level 
(they are not handled as part of a normal dimension, but there's an optional 
MeasureTypeDimension pseudo 
dimension mechanism that is used for mapping cross-sectional datasets to normal 
time series datasets
- dimensions can be attached on the observation, section or group level
- the time dimension, if present, must be attached on the group level
- attributes can be attached on observation, section, group or dataset level

The proposed “flattened”, SCOVO-like design for SDMX-RDF syntax (usable for 
time series and XS datasets) 
would be:
- measure is always on the observation level
- dimensions are always on the observation level
- time is treated as a normal dimension like all others
- attributes can be on any level
- time series, groups and sections are treated as optional additional structure 
within the dataset and can be 
omitted if no attributes attach to them

An alternative might be to follow the time series model and attach all 
dimensions but time on the time series 
level. This would require quite a bit of redesign. The flattened model would 
still be necessary to support 
cross-sectional datasets.

I believe that attaching dimensions anywhere else is a horrible idea and should 
be avoided.

Original comment by richard....@gmail.com on 30 Mar 2010 at 7:06

GoogleCodeExporter commented 9 years ago
Having done more thinking on this, I'll go out on a limb now and assert that 
SDMX's design for cross-sectional 
datasets is botched and we should ignore it as best as we can.

A possible idea would be to support two kinds of datasets, 
sdmx:TimeSeriesDataSet and sdmx:FlatDataSet. The 
former has key values attached to time series; the latter has key values 
attached to observations, with an optional 
additional sdmx:time dimension. SPARQL query writers would probably write two 
versions of their queries, 
depending on the type of the dataset.

Original comment by richard....@gmail.com on 30 Mar 2010 at 11:18

GoogleCodeExporter commented 9 years ago
For the flattened design I'd be inclined to require attributes to be at the 
series
level (though can be replicated higher up) to reduce the number of places you 
have to
look.

Is it worth having the separate TimeSeriesDataSet notion or just have everything
flat?  The value of having TimeSeriesDataSet would be (a) closer match to 
typical
SDMX usage and (b) space saving.

To estimate space saving let's suppose we have 6 dimensions (time plus 5 
others), 10
values for each, 2 attributes, a dense cube. If we put attributes at the time 
series
level then in the flat case each observation corresponds to 10 triples (6 dim,
rdf:type, sdmx:obsValue, sdmx:dataset, inv-sdmx:observation) and each 
timeseries has
4 (2 attributes, rdf:type, inv-sdmx:key) - 10,400,000 triples plus noise. In the
TimeSeriesDataSet case each observation has 5 (time, rdf:type, sdmx:obsValue,
sdmx:dataset, inv-sdmx:observation) and timeseries has 9 - 5,900,000 triples. 
Though
I'd argue that with a TimeSeriesDataSet you don't need both sdmx:dataset and
inv-sdmx:observation on every observation which would bring that down to more 
like 5m
triples, i.e. half.

Not sure what make of that. A 50% saving is just big enough to be useful but 
not big
enough to be a compelling argument.

My inclination is have both available but recommend use of sdmx:FlatDataSet for 
most
purposes.

Original comment by Dave.e.R...@gmail.com on 1 Apr 2010 at 1:22

GoogleCodeExporter commented 9 years ago
An argument for TimeSeriesDataSet is that it makes the structure more apparent,
because dimensions only appear where they vary. I would also have a minor 
concern
that a triple set with lots of duplicated information in it could look buggy 
(that
would be my reaction anyway!)

Original comment by i.j.dick...@gmail.com on 1 Apr 2010 at 1:47

GoogleCodeExporter commented 9 years ago
One of the possibilities that emerged in the call:

There could be a “compact publisher view” and “convenient but 
high-redundancy consumer view for easy 
querying”. Tools could translate from the former to the latter in a very 
mechanical way, perhaps as easy as 
running a SPARQL CONSTRUCT query.

The “compact publisher view” would correspond to the time series modelling. 
It could be indicated by typing 
the dataset as sdmx:TimeSeriesDataSet. All dimensions but time would go on the 
TimeSeries level. Attributes 
go to whatever level has been declared as their attachment level in the DSD.

The “high-redundancy consumer view” would be indicated by typing the 
dataset as sdmx:FlatDataSet. All 
dimensions and attributes would go on the observation level.

A dataset could be typed as TimeSeriesDataSet and FlatDataSet at the same time, 
giving the advantage that 
you could also SPARQL directly for time series.

The narrative would explain that publishers SHOULD provide the FlatDataSet if 
feasible. But if not, then the 
transformation could also be done by the client, or the client might decide to 
query against the less 
convenient TimeSeries representation. Either way, having the two subtypes of 
DataSet allows a consumer to 
find out what they are dealing with and where to expect the dims/atts.

Original comment by richard....@gmail.com on 15 Apr 2010 at 10:35

GoogleCodeExporter commented 9 years ago

Original comment by i.j.dick...@gmail.com on 7 May 2010 at 9:06