codeaudit / gwt-chronoscope

Automatically exported from code.google.com/p/gwt-chronoscope
GNU Lesser General Public License v2.1
1 stars 0 forks source link

Interpolated values #1

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I realized is that the input time series can not have gaps. The time series
has to contain all dates, setting the value of the dates without explicit
to zero. Otherwise Chronoscope kind of interpolates the missing values,
which is obviously wrong for some problems.

Take the case where you count accesses to some resource. Say we have the
date 06/09/07 with 12 accesses and the date 06/11/07 with 6 accesses, not
setting the 10th to 0 will result in a wrong graph with an interpolated
value of 9 accesses for the date 06/10/07.

That is not necessarily an error, but it would be nice have some way to
tell Chronoscope whether it should interpolate or set the missing values to 0.

Note that Ray Cromwell already responded to that issue on my blog:
'The issue with interpolated values is a complex one due to the zoomable
nature of the chart. One of the current problems is that X axis values
should be capable of being aligned on hard date intervals (monthly, daily,
etc). The other issue is when zooming out, the lower detail levels need
more flexibility. You should be able to specify whether you want a zoomed
out view of say, daily data, to be aggregated by mean, median, sum, max,
min, etc. That way, if you're looking at say, daily web traffic, you can
specify "SUM", and the zoomed out view will display the monthly total.'

Original issue reported on code.google.com by a.bue...@gmail.com on 24 Nov 2007 at 8:54

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago

I am working on a new version of the XYMultiresolution class at the moment that 
is
able to pluggably partition the dataset, and apply pluggable aggregate 
functions to
the X or Y values.

For example, with DatePartitionStrategy, the partitioner will group points into
integral date intervals (seconds, minutes, quarter hours, hours, days, weeks, 
months,
quarters, decades, centuries, etc more I haven't mentioned in between) These
partitions are guaranteed to begin on integral start and end dates, so months 
begin
on the first day, and end on the beginning of the next month. An hour begins at
XX:00:00 and ends at YY:00:00 (YY = XX + 1 mod 24). 

After the dataset is partitioned, an aggregate function is used to create the 
upper
resolutions of the dataset. Examples of aggregates: SUM, AVERAGE, AVERAGE_AREA,
EXTREMA (most outlying point in interval), etc. 

From there, the data is repartitioned again, so daily data was partitioned into
weeks, and a SUM applied (total per week), then a set of month partitions might 
be
instantiated, with the next resolution lever being total per month.

The complexity is in reusing previous work and dealing with sparse data.  If 
you just
computed average value per week, and now you're computing average value per 
month,
you don't need to recompute 28-31 datapoints, as you can reuse some of the 
values
from the previous resolution level.  Also, if you've got a dataset with 10 
datapoints
in 1970, and 1000 datapoints in 2000, with emptiness in between, you want to
efficiency skip over partitions with zero data.

-Ray

Original comment by cromwell...@gmail.com on 19 Dec 2007 at 12:08

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
In our applications, we need to be able to distinguish between zero and missing 
data.  

An alternative way to approach the above issue is to explicitly indicate the 
points
being used to draw the chart (once you have zoomed in sufficiently to make this
possible).  

Thus, in the above example, there would be a data point indicated in the trend 
line
for 6/9/07 and a data point indicated in the trend line for 6/11/07, but no data
point in the trend line for 6/10/07.  

This represents explicitly to the user that there was no data found for 
6/10/07.  The
other approach of having the trend line suddenly dip to zero represents 
something
very different: that there was data for 6/10/07 and that the value of that data 
was zero.

Original comment by philipmj...@gmail.com on 10 Mar 2008 at 4:55

GoogleCodeExporter commented 9 years ago
Phillip, 
 Some charting libraries deal with that situation by using what's referred to as a
'gap threshold' That is, you draw a connecting line segment between point P1 and
point P2, so long as distance(P1, P2) < gap_threshold. For example, if you set 
the
gap threshold to be 1 day, then any gap in the dataset that spans more than 24 
hours
will yield a break in the trend line. That is, there will be a line ending at 
6/9/07
and one beginning again at 6/11/07, but a gap between 6/9 and 6/11.

Gap threshold is actually another RFE issue that is being worked on.

-Ray

Original comment by cromwell...@gmail.com on 10 Mar 2008 at 5:13

GoogleCodeExporter commented 9 years ago
For gaps chronoscope should definitely not assume a value of zero.  It should 
either assume "there's no data for 
this period" and handle it cleanly (e.g. weekend dates for stock prices) or it 
should throw some sort of exception.

-Jason

Original comment by jason....@gmail.com on 15 Aug 2008 at 4:35

GoogleCodeExporter commented 9 years ago
Why not let the range contain nulls to indicate missing data? Is it a 
performance consideration?

Original comment by tom...@gmail.com on 23 Sep 2010 at 12:18

GoogleCodeExporter commented 9 years ago
When there are many unpredictable gaps, it's more efficient to set a gap 
threshold that serves to distinguish missing data from some range of expected 
intervals between data.  In the case of something with a known calendar, like 
stock market data with gaps every evening and weekend, it's more efficient to 
set the calendar.  

The situation that comes to mind where you might want to denote a known gap 
with nulls or NaNs would then be when that gap is less than the gap threshold 
which is to say it's a distance that could just as well be the interval between 
connected points.   It's probably best to denote known small gaps outside the 
actual data values in a way similar to domain highlight regions marking some 
interval of interest.  

A use case might help, if you have a specific one in mind.

Original comment by timepedia@gmail.com on 25 Sep 2010 at 4:57

GoogleCodeExporter commented 9 years ago
What about using another numeric flag to indicate whether a zero value is bad 
or valid data? 

Original comment by sholto.m...@gmail.com on 17 Feb 2011 at 3:14