OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
5.01k stars 1.25k forks source link

Vector time series support #1172

Open dixson3 opened 6 years ago

dixson3 commented 6 years ago

I am interested in using opentsdb, but I need the ability to record and process a vector value at a timestamp and not a simple scalar, e.g.:

spectra host=spectrometer01 1356998400 [10,20,30,40,...,8010] spectra host=spectrometer01 1356998401 [9,21,30,41,...,8011] spectra host=spectrometer01 1356998402 [10,20,32,40,...,8009]

The vector is quite large (approximately 800, 32bit unsigned integers) and so it feels impractical to create a unique metric for each factor in the vector.

Are there plans to support vector time series or does anyone have a similar use case that they have successfully implemented on top of opentsdb?

manolama commented 6 years ago

Hi @dixson3 What's the use case for the vectors? i.e. what do they represent?

dixson3 commented 6 years ago

The vector is raw amplitude measurements from a spectrometer across a constant range of wavelengths. In simple terms, we measure spectra and then convert that spectra into compositional profile.

Our current infrastructure is real-time but very "one shot" we do the model fit at the time of measurement and then record the result in a simple relational database.

What we would like to do is to record the raw spectra vector in a timeseries db, run models in real-time and store the computed composition data in a timeseries db, and also run new models against historical data.

manolama commented 6 years ago

Gotcha. Is the data for each vector always representing the same wavelengths in the same order?

The typical way to handle this is to simply create a series for each wavelength, e.g.

spectra host=spectrometer01, wavelength=380–450 1356998400 10
spectra host=spectrometer01, wavelength=450–495 1356998400 20
spectra host=spectrometer01, wavelength=495–570 1356998400 30

then you can slice and dice the data using the existing query engine.

Alternatively, TSD 3.x supports pluggable data types so you could write functions for working over the wavelength vectors.

dixson3 commented 6 years ago

Yes it is, same wavelengths in same order.

I have thought about using tags to represent the wavelength values, it is not my preference given the width of the vector (over 800 wavelengths).

I will look at the 3.x codebase, creating a plugin for the vector type is most probably the "right way".

manolama commented 6 years ago

Cool. What kind of aggregations would you perform on the vectors?

thinrope commented 6 years ago

This use-case sounds a lot like my needs to store gamma-radiation spectra in TSDB. Generally they have 2^n channels (n=8..14, usually 10) with integers, very often with lots of 0s (e.g. fast sampling with high resolution). As a specific simple example, 2048 channel spectrum is collected every second for 6 hours, then the sum of all is represented as a graph and analyzed (you can have a look at some spectra on my site). In more involved experiments, the detector is in motion (on a car, in Fukushima), so a simple sum of a few hours (drive) makes no much sense; instead partial sums (across specific time-intervals, referenced to geography) are needed.

The standard SUM (per channel, or dimension in vector-talk) are always useful (from T0 to T1). Sometimes difference (e.g. subtract a known constant "baseline", "background" or "noise") is also needed. Simple scaling (multiply all dimensions by scalar), or the generic transformation with a polynomial (if N is the channel/dimension, calculate C= a + bN + cN^2 + .. and multiply each element by the respective C), or effectively the Hadamard product another vector (e.g. calibration, energy efficiency, etc.).