Open GoogleCodeExporter opened 9 years ago
[deleted comment]
I am working on a new version of the XYMultiresolution class at the moment that
is
able to pluggably partition the dataset, and apply pluggable aggregate
functions to
the X or Y values.
For example, with DatePartitionStrategy, the partitioner will group points into
integral date intervals (seconds, minutes, quarter hours, hours, days, weeks,
months,
quarters, decades, centuries, etc more I haven't mentioned in between) These
partitions are guaranteed to begin on integral start and end dates, so months
begin
on the first day, and end on the beginning of the next month. An hour begins at
XX:00:00 and ends at YY:00:00 (YY = XX + 1 mod 24).
After the dataset is partitioned, an aggregate function is used to create the
upper
resolutions of the dataset. Examples of aggregates: SUM, AVERAGE, AVERAGE_AREA,
EXTREMA (most outlying point in interval), etc.
From there, the data is repartitioned again, so daily data was partitioned into
weeks, and a SUM applied (total per week), then a set of month partitions might
be
instantiated, with the next resolution lever being total per month.
The complexity is in reusing previous work and dealing with sparse data. If
you just
computed average value per week, and now you're computing average value per
month,
you don't need to recompute 28-31 datapoints, as you can reuse some of the
values
from the previous resolution level. Also, if you've got a dataset with 10
datapoints
in 1970, and 1000 datapoints in 2000, with emptiness in between, you want to
efficiency skip over partitions with zero data.
-Ray
Original comment by cromwell...@gmail.com
on 19 Dec 2007 at 12:08
[deleted comment]
In our applications, we need to be able to distinguish between zero and missing
data.
An alternative way to approach the above issue is to explicitly indicate the
points
being used to draw the chart (once you have zoomed in sufficiently to make this
possible).
Thus, in the above example, there would be a data point indicated in the trend
line
for 6/9/07 and a data point indicated in the trend line for 6/11/07, but no data
point in the trend line for 6/10/07.
This represents explicitly to the user that there was no data found for
6/10/07. The
other approach of having the trend line suddenly dip to zero represents
something
very different: that there was data for 6/10/07 and that the value of that data
was zero.
Original comment by philipmj...@gmail.com
on 10 Mar 2008 at 4:55
Phillip,
Some charting libraries deal with that situation by using what's referred to as a
'gap threshold' That is, you draw a connecting line segment between point P1 and
point P2, so long as distance(P1, P2) < gap_threshold. For example, if you set
the
gap threshold to be 1 day, then any gap in the dataset that spans more than 24
hours
will yield a break in the trend line. That is, there will be a line ending at
6/9/07
and one beginning again at 6/11/07, but a gap between 6/9 and 6/11.
Gap threshold is actually another RFE issue that is being worked on.
-Ray
Original comment by cromwell...@gmail.com
on 10 Mar 2008 at 5:13
For gaps chronoscope should definitely not assume a value of zero. It should
either assume "there's no data for
this period" and handle it cleanly (e.g. weekend dates for stock prices) or it
should throw some sort of exception.
-Jason
Original comment by jason....@gmail.com
on 15 Aug 2008 at 4:35
Why not let the range contain nulls to indicate missing data? Is it a
performance consideration?
Original comment by tom...@gmail.com
on 23 Sep 2010 at 12:18
When there are many unpredictable gaps, it's more efficient to set a gap
threshold that serves to distinguish missing data from some range of expected
intervals between data. In the case of something with a known calendar, like
stock market data with gaps every evening and weekend, it's more efficient to
set the calendar.
The situation that comes to mind where you might want to denote a known gap
with nulls or NaNs would then be when that gap is less than the gap threshold
which is to say it's a distance that could just as well be the interval between
connected points. It's probably best to denote known small gaps outside the
actual data values in a way similar to domain highlight regions marking some
interval of interest.
A use case might help, if you have a specific one in mind.
Original comment by timepedia@gmail.com
on 25 Sep 2010 at 4:57
What about using another numeric flag to indicate whether a zero value is bad
or valid data?
Original comment by sholto.m...@gmail.com
on 17 Feb 2011 at 3:14
Original issue reported on code.google.com by
a.bue...@gmail.com
on 24 Nov 2007 at 8:54