Open GoogleCodeExporter opened 9 years ago
I think especially measures should always be at the observation level! Isn't
that what defines an observation?
I think for dimensions it's extremely desirable to have as few alternative
attachment levels as possible,
preferably only one. Writing SPARQL queries for retrieving a specific
observation is horrible if each dimension
value could be found on any level.
For attributes a bit more flexibility may not hurt, because here typical
queries are more like “all attributes on
time series level in cube region xyz”, and the cube region is defined by
dimensions.
If I'm not mistaken, attachment levels in SDMX are as follows:
For time series data sets:
- measure is always attached on the observation level
- the time pseudo-dimension is always on the observation level (it's not a
declared dimension)
- all other dimensions are on the time series level
- attributes can be attached on observation, time series, group, or dataset
level
- dimensions that are used in group keys are redundant; the values of those
dimensions are found on the
group and on the time series level (in the information model; in concrete
syntaxes they might be implied by
nesting on the time series level)
For cross-sectional data sets (not sure if I really understand everything):
- there is no measure but there are usually several XSMeasures, which are
referenced on the observation level
(they are not handled as part of a normal dimension, but there's an optional
MeasureTypeDimension pseudo
dimension mechanism that is used for mapping cross-sectional datasets to normal
time series datasets
- dimensions can be attached on the observation, section or group level
- the time dimension, if present, must be attached on the group level
- attributes can be attached on observation, section, group or dataset level
The proposed “flattened”, SCOVO-like design for SDMX-RDF syntax (usable for
time series and XS datasets)
would be:
- measure is always on the observation level
- dimensions are always on the observation level
- time is treated as a normal dimension like all others
- attributes can be on any level
- time series, groups and sections are treated as optional additional structure
within the dataset and can be
omitted if no attributes attach to them
An alternative might be to follow the time series model and attach all
dimensions but time on the time series
level. This would require quite a bit of redesign. The flattened model would
still be necessary to support
cross-sectional datasets.
I believe that attaching dimensions anywhere else is a horrible idea and should
be avoided.
Original comment by richard....@gmail.com
on 30 Mar 2010 at 7:06
Having done more thinking on this, I'll go out on a limb now and assert that
SDMX's design for cross-sectional
datasets is botched and we should ignore it as best as we can.
A possible idea would be to support two kinds of datasets,
sdmx:TimeSeriesDataSet and sdmx:FlatDataSet. The
former has key values attached to time series; the latter has key values
attached to observations, with an optional
additional sdmx:time dimension. SPARQL query writers would probably write two
versions of their queries,
depending on the type of the dataset.
Original comment by richard....@gmail.com
on 30 Mar 2010 at 11:18
For the flattened design I'd be inclined to require attributes to be at the
series
level (though can be replicated higher up) to reduce the number of places you
have to
look.
Is it worth having the separate TimeSeriesDataSet notion or just have everything
flat? The value of having TimeSeriesDataSet would be (a) closer match to
typical
SDMX usage and (b) space saving.
To estimate space saving let's suppose we have 6 dimensions (time plus 5
others), 10
values for each, 2 attributes, a dense cube. If we put attributes at the time
series
level then in the flat case each observation corresponds to 10 triples (6 dim,
rdf:type, sdmx:obsValue, sdmx:dataset, inv-sdmx:observation) and each
timeseries has
4 (2 attributes, rdf:type, inv-sdmx:key) - 10,400,000 triples plus noise. In the
TimeSeriesDataSet case each observation has 5 (time, rdf:type, sdmx:obsValue,
sdmx:dataset, inv-sdmx:observation) and timeseries has 9 - 5,900,000 triples.
Though
I'd argue that with a TimeSeriesDataSet you don't need both sdmx:dataset and
inv-sdmx:observation on every observation which would bring that down to more
like 5m
triples, i.e. half.
Not sure what make of that. A 50% saving is just big enough to be useful but
not big
enough to be a compelling argument.
My inclination is have both available but recommend use of sdmx:FlatDataSet for
most
purposes.
Original comment by Dave.e.R...@gmail.com
on 1 Apr 2010 at 1:22
An argument for TimeSeriesDataSet is that it makes the structure more apparent,
because dimensions only appear where they vary. I would also have a minor
concern
that a triple set with lots of duplicated information in it could look buggy
(that
would be my reaction anyway!)
Original comment by i.j.dick...@gmail.com
on 1 Apr 2010 at 1:47
One of the possibilities that emerged in the call:
There could be a “compact publisher view” and “convenient but
high-redundancy consumer view for easy
querying”. Tools could translate from the former to the latter in a very
mechanical way, perhaps as easy as
running a SPARQL CONSTRUCT query.
The “compact publisher view” would correspond to the time series modelling.
It could be indicated by typing
the dataset as sdmx:TimeSeriesDataSet. All dimensions but time would go on the
TimeSeries level. Attributes
go to whatever level has been declared as their attachment level in the DSD.
The “high-redundancy consumer view” would be indicated by typing the
dataset as sdmx:FlatDataSet. All
dimensions and attributes would go on the observation level.
A dataset could be typed as TimeSeriesDataSet and FlatDataSet at the same time,
giving the advantage that
you could also SPARQL directly for time series.
The narrative would explain that publishers SHOULD provide the FlatDataSet if
feasible. But if not, then the
transformation could also be done by the client, or the client might decide to
query against the less
convenient TimeSeries representation. Either way, having the two subtypes of
DataSet allows a consumer to
find out what they are dealing with and where to expect the dims/atts.
Original comment by richard....@gmail.com
on 15 Apr 2010 at 10:35
Original comment by i.j.dick...@gmail.com
on 7 May 2010 at 9:06
Original issue reported on code.google.com by
Dave.e.R...@gmail.com
on 25 Mar 2010 at 10:07